NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick
The newest generation of the popular Llama AI models, Llama 4 Scout and Llama 4 Maverick, has arrived, accelerated by NVIDIA's open-source software. These models can achieve over 40K output tokens per second on NVIDIA Blackwell B200 GPUs. They are also available to try as NVIDIA NIM microservices.
Natively Multimodal and Multilingual
The Llama 4 models are now natively multimodal and multilingual, utilizing a mixture-of-experts (MoE) architecture. These models deliver a variety of multimodal capabilities, driving advances in scale, speed, and efficiency to enable the creation of more personalized experiences.
Llama 4 Scout
Llama 4 Scout is a 109B-parameter model with a configuration of 16 experts and a 10M context-length window optimized and quantized to int4 for a single NVIDIA H100 GPU. This configuration enables various use cases, including multi-document summarization, parsing extensive user activity for personalized tasks, and reasoning over vast codebases.

Llama 4 Maverick
Llama 4 Maverick is a 400B-parameter model with a configuration of 128 experts accepting a 1M context length. This model excels in high-performance image and text understanding.
NVIDIA Optimizations
NVIDIA has optimized both Llama 4 Scout and Llama 4 Maverick models for NVIDIA TensorRT-LLM. TensorRT-LLM is an open-source library designed to accelerate LLM inference performance for the latest foundation models on NVIDIA GPUs.

TensorRT Model Optimizer can be used to refactor bfloat16 models with the latest algorithmic model optimizations and quantization techniques, accelerating inference with Blackwell FP4 Tensorcore performance without compromising model accuracy.
Performance Leaps with Blackwell B200 GPU
The Blackwell B200 GPU delivers significant performance leaps through architectural innovations, including a second-generation Transformer Engine, fifth-generation NVLink, and FP8, FP6, and FP4 precision. These enhancements provide 3.4x faster throughput and 2.6x better cost per token compared to NVIDIA H200 for Llama 4.
Open Source Collaboration
NVIDIA and Meta have a history of collaborating to advance open models. NVIDIA actively contributes to open source, enabling efficient work, addressing challenges, and enhancing performance while reducing costs.
Fine-Tuning with NVIDIA NeMo
Finetuning the Llama models is seamless with NVIDIA NeMo, an end-to-end framework tailored for customizing large language models (LLMs) with enterprise data. NeMo Curator assists in curating high-quality datasets for pretraining or fine-tuning, while NeMo supports efficient finetuning with techniques like LoRA, PEFT, and full parameter tuning.

NeMo Evaluator aids in evaluating model performance with support for industry benchmarks and custom test sets specific to your use case.
Deployment with NVIDIA NIM
The Llama 4 models will be packaged as NVIDIA NIM microservices, simplifying deployment on any GPU-accelerated infrastructure with flexibility, data privacy, and enterprise-grade security. NIM supports industry-standard APIs for quick deployment and scalability across clouds, data centers, and edge environments.
Experiment with your data and build a proof of concept by trying the Llama 4 NIM microservices provided by NVIDIA.