The Rise of Multimodal Large Language Models in AI

Published On Mon Jul 01 2024

Multimodal Large Language Models (MLLMs) transforming Computer Vision

Learn about the Multimodal Large Language Models (MLLMs) that are redefining and transforming Computer Vision. This article introduces what is a Multimodal Large Language Model (MLLM), their applications using challenging prompts, and the top models reshaping Computer Vision as we speak.

What are Multimodal Large Language Models?

In layman terms, a Multimodal Large Language Model (MLLM) is a model that merges the reasoning capabilities of Large Language Models (LLMs), for instance GPT-3 or LLaMA-3, with the ability to receive, reason, and output with multimodal information. Figure 1 illustrates a multimodal AI system in healthcare. It receives two inputs: a medical image and a query in text: "Is pleural effusion present in this image?". The system output consists of an answer (i.e., a prediction) to the given query.

Transformation in Artificial Intelligence

Over the past few years, there has been a significant transformation in Artificial Intelligence, largely driven by the rise of Transformers in Language Models. One of the earliest examples was Vision Transformers (ViT), which uses Transformers to segment images into multiple patches, treating them as individual visual tokens for input representation. With the rise of Large Language Models (LLMs), a new type of generative model, Multimodal Large Language Models (MLLMs) naturally emerged.

Vision Language Models (VLMs)

Vision Language Models (VLMs) are a specialized category of Multimodal Models that integrate text and image inputs and generate text outputs. The main difference between Multimodal Models and VLMs lies in the capacity of MLLMs to work with more modalities, not only text and images as VLMs, and VLMs being less performant in reasoning skills.

Performance Testing of MLLMs

Instead of providing a list of the different use cases where these models excel, we spun a couple of GPUs to test three the top MLLMs using challenging queries. Figure 4 shows how these three top models performed when given an image and a challenging prompt that requested them to count hard hats.

Frontiers of multimodal learning: A responsible AI approach

Challenges and Future Prospects

Multimodal Large Language Models (MLLMs) are good on average, but apparently they aren't ready to solve computer vision tasks for more demanding use-cases. Even a YOLOv8 model does better in such specific (niche) tasks. Is fine-tuning MLLMs the way to go instead?

Impact of MLLMs on Computer Vision

Multimodal models are definitively transforming computer vision. As an ML/MLOps Engineer, how can you best leverage them when building robust AI pipelines? Moreover, how do these models, some of them also known as foundation models, impact a traditional computer vision pipeline? Learn more about the cutting edge of multimodality and foundation models in our brand-new CVPR 2024 series: