Hugging Face claims world’s smallest vision language models

The models are designed to run on constrained devices with less than 1GB of RAM

Hugging Face has introduced two new models in its SmolVLM series, which it claims are the smallest Vision Language Models (VLMs) to date.

The models, SmolVLM-256M and SmolVLM-500M, are designed to deliver multimodal performance, including tasks such as image captioning, document Q&A and basic visual reasoning, while using significantly fewer computational resources than their predecessors.

According to Hugging Face, the 256M model, with just 256 million parameters, can run on constrained devices such as laptops with less than 1GB of RAM. The company says the models are also suited for developers looking to process large amounts of data cost-effectively.

The release marks a shift towards making AI tools accessible on lower-specification hardware. Hugging Face says the 256M model, which is being positioned as the smallest VLM ever created, offers performance comparable to much larger models released just over a year ago. The 500M model provides additional performance for more demanding tasks while maintaining a compact design.

To achieve this smaller scale, the company implemented several changes. The models feature a reduced-size vision encoder with 93 million parameters, replacing the previously used SigLIP 400M SO.

Despite the smaller size, Hugging Face claims the encoder can process images at a larger resolution, enhancing visual understanding without increasing computational demands. Additionally, special tokens for sub-image separators were introduced, allowing for more efficient processing and improved stability during training. Updates to the training data balance also increased the focus on document understanding and image captioning, which Hugging Face suggests enhances task-specific performance.

The new models are part of Hugging Face’s broader SmolVLM and SmolLM series, which now includes a range of smaller-scale language and vision-language models. Hugging Face says these smaller models aim to address the needs of developers working on constrained hardware or large-scale data processing, offering a trade-off between size and performance.