Chatbot 2.0: MiniGPT-4 Paves the Way

Published On Mon May 08 2023

MiniGPT-4: Small Chatbot, Large Vision-Language Understanding

ChatGPT made a big impact when it was released, but it was limited to text. GPT-4 was supposed to solve this problem by answering questions that include images and producing detailed answers. However, OpenAI has not yet released this feature.

MiniGPT-4 starts with the principle that a model capable of interacting with text and images can be achieved without being disproportionately large. The creators of MiniGPT-4 took advantage of previous research, using Vicuna as a language model and a vision transformer to process images and extract meaningful features. They trained the model using 5 million image-text pairs from the LAION dataset.

The resulting model shows similar capabilities to GPT-4, but there are still some limitations. For example, MiniGPT-4 has problems recognizing detailed textual information that comes from images or spatial localizations. However, the model could be improved with more image-text data or by replacing it with a better visual perception model.

Despite its limitations, MiniGPT-4 is an interesting model that could be improved with more training. The creators used a clever idea by exploiting the BLP-2 model and adding an additional instruction tuning step. In theory, this model could still be used in inference in a standard GPU. Moreover, they trained it for only 10 hours on 4 A100 GPUs, making it the most efficient and open-source visual language model.

If you're interested, you can test the MiniGPT-4 demo yourself and check out the code, model, and dataset that are available. The MiniGPT-4 model shows that even using already trained models and adapting them can yield interesting results.