Empowering AI Innovation: Meta's Releases for Multi-Modal Processing and Music Generation

Published On Thu Jun 20 2024

Meta unveils five AI models for multi-modal processing, music generation, and more

Meta has unveiled five major new AI models and research, including multi-modal systems that can process both text and images, next-gen language models, music generation, AI speech detection, and efforts to improve diversity in AI systems. The releases come from Meta’s Fundamental AI Research (FAIR) team which has focused on advancing AI through open research and collaboration for over a decade. As AI rapidly innovates, Meta believes working with the global community is crucial.

By publicly sharing this research, Meta hopes to inspire iterations and ultimately help advance AI in a responsible way. Among the releases are key components of Meta’s ‘Chameleon’ models under a research license. Chameleon is a family of multi-modal models that can understand and generate both text and images simultaneously—unlike most large language models which are typically unimodal.

Meta unveils five AI models for multi-modal processing, music generation, and more

"Just as humans can process the words and images simultaneously, Chameleon can process and deliver both image and text at the same time," explained Meta. "Chameleon can take any combination of text and images as input and also output any combination of text and images." Potential use cases are virtually limitless from generating creative captions to prompting new scenes with text and images.

Key Components of Meta’s AI Models

Meta has also released pretrained models for code completion that use ‘multi-token prediction’ under a non-commercial research license. Traditional language model training is inefficient by predicting just the next word. Multi-token models can predict multiple future words simultaneously to train faster.

Meta AI Introduces AudioSeal: The First Audio Watermarking

"While [the one-word] approach is simple and scalable, it’s also inefficient. It requires several orders of magnitude more text than what children need to learn the same degree of language fluency," said Meta.

On the creative side, Meta’s JASCO allows generating music clips from text while affording more control by accepting inputs like chords and beats.

Improving Diversity in AI Systems

Meta claims AudioSeal is the first audio watermarking system designed to detect AI-generated speech. It can pinpoint the specific segments generated by AI within larger audio clips up to 485x faster than previous methods. Another important release aims to improve the diversity of text-to-image models which can often exhibit geographical and cultural biases.

Meta developed automatic indicators to evaluate potential geographical disparities and conducted a large 65,000+ annotation study to understand how people globally perceive geographic representation. "This enables more diversity and better representation in AI-generated images," said Meta.