Meta AI Releases Llama Guard 3-1B-INT4: A Compact and High ...
Generative AI systems have revolutionized human interactions with technology, providing cutting-edge capabilities in natural language processing and content generation. However, these systems come with significant risks, especially in producing unsafe or policy-violating content. To tackle this challenge, advanced moderation tools are essential to ensure that the outputs are safe and compliant with ethical standards. These tools need to be efficient and effective, especially for deployment on hardware with limited resources, such as mobile devices.
Challenges in Safety Moderation Models
One of the primary challenges in deploying safety moderation models is their size and computational requirements. While large language models (LLMs) are powerful and accurate, they demand significant memory and processing power, making them unsuitable for devices with constrained hardware capabilities. Deploying such models can result in runtime bottlenecks or failures on devices with limited DRAM, severely restricting their usability. Researchers have been working on compressing LLMs without compromising performance to address this issue.
Introducing Llama Guard 3-1B-INT4
Meta introduced Llama Guard 3-1B-INT4, a safety moderation model designed to overcome these challenges. This model, unveiled during Meta Connect 2024, is just 440MB in size, significantly smaller than its predecessor, Llama Guard 3-1B. The reduction in size was achieved through advanced compression techniques such as decoder block pruning, neuron-level pruning, and quantization-aware training. By distilling from a larger Llama Guard 3-8B model, the researchers were able to maintain quality post-compression. Notably, the model delivers a throughput of at least 30 tokens per second, with a time-to-first-token of less than 2.5 seconds on a standard Android mobile CPU.

Technical Advancements in Llama Guard 3-1B-INT4
Several methodologies underpin the technical advancements in Llama Guard 3-1B-INT4. Pruning techniques reduced the model’s decoder blocks and hidden dimensions, resulting in a parameter count of 1.1 billion. Quantization further compressed the model by reducing the precision of weights and activations, significantly cutting down its size while maintaining compatibility with existing interfaces. These optimizations ensured the model's usability on mobile devices without compromising safety standards.
Performance and Results
The performance of Llama Guard 3-1B-INT4 highlights its effectiveness. It achieves an F1 score of 0.904 for English content, outperforming its larger counterpart, Llama Guard 3-1B. In multilingual capabilities, the model performs on par with or better than larger models in several non-English languages. When compared to GPT-4, Llama Guard 3-1B-INT4 demonstrates superior safety moderation scores in multiple languages. The reduced size and optimized performance make it a practical solution for mobile deployment, as evidenced by successful testing on a Moto-Razor phone.
Conclusion
In conclusion, Llama Guard 3-1B-INT4 represents a significant advancement in safety moderation for generative AI. It effectively addresses challenges related to size, efficiency, and performance, providing a compact yet robust model for mobile deployment. Through innovative compression techniques and meticulous fine-tuning, researchers have developed a scalable and reliable tool that sets the stage for safer AI systems in various applications.
For more information, you can access the Paper and Codes.





















