Meta has released Llama 4 Scout and Llama 4 Maverick, two open-weight frontier large language models that introduce major architectural changes while expanding the company’s presence across consumer apps and cloud platforms.
Both models are designed with a native multimodal structure and a sparse mixture-of-experts (MoE) system, with Scout optimized for single-GPU use and Maverick targeting larger enterprise workloads.
The company has also revealed a 2-trillion parameter teacher model—Llama 4 Behemoth—currently still in training, and a multimodal vision model, Llama 4-V, to follow later.
While Scout is built to fit on a single H100 GPU via int4 quantization, it still offers a best-in-class 10 million token context length, a tenfold leap over previous models. It features 17 billion active parameters with 16 experts and 109 billion total parameters.
Maverick shares the same active parameter count but scales the MoE setup to 128 experts and 400 billion total parameters, enabling more sophisticated reasoning and image understanding tasks. Both models process images and text jointly through early fusion—a method where both token types are embedded in the same backbone model during pretraining.
As part of Meta’s system-level design, models were trained on up to 48 images per example, with Scout tested post-training on as many as eight. This visual grounding enables capabilities such as object localization and improved alignment between image content and language prompts. According to Meta, “Llama 4 Scout is best-in-class on image grounding, able to align user prompts with relevant visual concepts and anchor model responses to regions in the image.”
Benchmark Performance: Scout, Maverick, and Behemoth
Llama 4 Maverick is positioned by Meta as a high-performance multimodal assistant, and internal benchmarks reflect that claim. On visual reasoning tasks, it achieves 90.0 on ChartQA and 94.4 on DocVQA, outperforming both GPT-4o and Gemini 2.0 Flash. It also logs 73.7 on MathVista and 80.5 on MMLU Pro, indicating strong general reasoning capabilities.
In programming tasks, Maverick scores 43.4 on LiveCodeBench, placing it ahead of GPT-4o and Gemini 2.0 Flash, and just below DeepSeek v3.1. Its assistant performance is reinforced by an ELO rating of 1417 on LMArena. For cost-efficiency, Meta estimates inference costs between $0.19–$0.49 per million tokens under a 3:1 input-output blend.
Llama 4 Scout, while smaller in scale, holds its own among models in its class. It scores 88.8 on ChartQA, matching Maverick with 94.4 on DocVQA, and reaches 74.3 on MMLU Pro. These results highlight its effectiveness in visual and reasoning benchmarks, particularly for lightweight or single-GPU deployments.
Its high score parity with larger models in image tasks signals strong design optimizations, especially for use cases that require context-rich multimodal understanding but less infrastructure overhead.
Llama 4 Behemoth remains unreleased but served as the teacher model for codistillation of Maverick and Scout. With 288 billion active parameters and nearly 2 trillion total, its performance places it in the upper echelon of current LLMs. Meta reports benchmark scores of 95.0 on MATH-500, 82.2 on MMLU Pro, 73.7 on GPQA Diamond, and 85.8 on Multilingual MMLU.
These scores indicate that Behemoth surpasses Claude Sonnet 3.7, Gemini 2.0 Pro, and GPT-4.5 in STEM and multilingual reasoning tasks, reinforcing its role as the foundation for the smaller Llama 4 models.
Training Strategy and Novel Architectures
Llama 4 marks Meta’s first use of MoE layers interspersed with dense layers in production models. Only a small fraction of the parameters are activated per token, improving efficiency without significantly affecting quality. Each Maverick token is routed to one of 128 experts plus a shared expert, with all experts loaded in memory but selectively activated during inference.
Meta also implemented a novel positional encoding scheme called iRoPE—interleaved rotary positional embeddings—that drops the need for fixed positional tokens and improves long-context generalization. “We call this the iRoPE architecture, where ‘i’ stands for ‘interleaved’ attention layers, highlighting the long-term goal of supporting ‘infinite’ context length.”
Scout and Maverick were both pre- and post-trained with 256K context windows to improve adaptation to longer sequences. The company used FP8 precision for training to increase throughput, achieving 390 TFLOPs per GPU during Behemoth’s pretraining across 32K GPUs. MetaP, a system for dynamically scaling initialization and learning rates, was used to generalize hyperparameter tuning across varying model sizes and batch configurations.
Cloud Availability and Licensing Changes
Meta is making Llama 4 Scout Llama 4 Maverick available for download on llama.com and Hugging Face. For the launch, Meta partnered with major cloud providers to expedite adoption. AWS already added Llama 4 Scout and Llama 4 Maverick to Amazon SageMaker JumpStart, with Bedrock support expected soon. Simultaneously, Microsoft rolled out support through Azure AI Foundry and Azure Databricks.
These integrations provide developers with direct access to preconfigured APIs for fine-tuning and inference, reducing time-to-deployment in production environments.
Licensing has also shifted. Unlike previous LLama models, which were primarily intended for non-commercial research, the new models are released under a custom commercial license. Meta describes it as flexible, though it stops short of full open-source status.
System-Level Safety and Bias Reduction
Alongside its model improvements, Meta emphasized a suite of safeguards. Llama Guard, an input/output classifier based on a risk taxonomy from MLCommons, is included to detect harmful content. Prompt Guard, trained on a wide range of attack types, is designed to catch jailbreak attempts and prompt injections. CyberSecEval helps developers test AI models against cybersecurity threats.
Meta also introduced a new red-teaming framework called GOAT—Generative Offensive Agent Testing. This tool simulates multi-turn conversations with medium-skilled adversarial actors, helping Meta increase testing coverage and uncover vulnerabilities more efficiently.
Bias remains a core concern. In tests on politically charged topics, refusal rates in Llama 4 have dropped to under 2%—down from 7% in Llama 3.3. Unequal response refusals across ideologies now fall below 1%. Meta says it is working toward models that can represent diverse viewpoints without imposing a stance.
Ecosystem Integration and Future Roadmap
Llama 4 Scout and Maverick are already live in Meta AI features across WhatsApp, Messenger, Instagram Direct, and the web interface. These integrations offer a broad testbed to evaluate performance in the wild, while simultaneously exposing the models to vast user input streams that could inform future improvements.
Looking ahead, Meta is set to showcase more details at LlamaCon on April 29. Topics will include further scaling of the Behemoth model and the introduction of Llama 4-V, a fully multimodal vision-language model capable of handling both static and temporal visual inputs. The announcement underscores Meta’s aim to deliver systems that are not just linguistically competent, but also capable of high-fidelity multimodal reasoning.
Meta’s position in the open-weight ecosystem remains nuanced. The Llama 4 models aren’t fully open-source, but they offer a degree of transparency and flexibility that sits between purely closed systems and community-driven models. Their deployment across billions of endpoints—from cloud APIs to messaging apps—could shape developer expectations around scale, performance, and responsible usage in the months ahead.