NVIDIA Speeds Up Meta’s Llama 4 AI Models With Blackwell GPUs, Hitting 40,000 Tokens Per S

April 8, 2025

image

NVIDIA (NVDA, Financials) said it boosted the speed and efficiency of Meta Platforms’ (META, Financials) latest Llama 4 artificial intelligence models by running them on its new Blackwell B200 graphics processors, reaching output speeds of over 40,000 tokens per second.

The two models, Llama 4 Scout and Llama 4 Maverick, are available as NVIDIA NIM microservices and are built with a mixture-of-experts design to handle multilingual and multimodal tasks. Llama 4 Scout has 109 billion parameters, with 17 billion active per token, and can handle up to 10 million tokens in a single context. It’s built to run efficiently on a single NVIDIA H100 GPU and is suited for summarizing documents, analyzing user activity, and reading large codebases.

Llama 4 Maverick is a larger model with 400 billion parameters and 128 experts, accepting up to 1 million context tokens. It’s designed for tasks that require understanding both images and text.

Both models are tuned for NVIDIA’s open-source TensorRT-LLM library, which speeds up inference for large models. NVIDIA said its TensorRT Model Optimizer lets developers restructure and compress models using bfloat16 and FP4 precision without hurting accuracy.

On the Blackwell B200 GPU, Llama 4 Scout delivers over 40,000 tokens per second, while Maverick tops 30,000. The company said this represents a 3.4x boost in throughput and 2.6x better cost efficiency compared to its previous H200 GPU, helped by Blackwell’s upgraded architecture and support for precision formats like FP8 and FP4.

NVIDIA said its work with Meta continues a broader effort to support open-source AI tools. The Llama 4 models can be customized with the NVIDIA NeMo framework, which helps companies prepare datasets, fine-tune models using techniques like LoRA and PEFT, and test performance using custom or standard benchmarks.

To help businesses get started, the models will be delivered as NIM microservices that can run on any GPU-accelerated system. These services support standard APIs and are built to scale across cloud, data center, and edge setups with privacy and security features.

This article first appeared on GuruFocus.

Terms and Privacy Policy