Meta’s On-Device AI Vision: From Chatbots to Personal Agents
May 19, 2026
May 19, 2026
According to a keynote speech delivered at the Embedded Vision Summit by Vikas Chandra, senior director at Meta Reality Labs, Meta is working to bring advanced perception and agentic intelligence to personal devices and wearables despite their limited hardware.
Shifting from Chatbots to Personal Agents
Chandra argued that the industry will move away from chatbots—though not entirely—toward personal agents that are always available, always private, and always ready. These agents would leverage the contextual awareness already present in smartphones and wearables, such as location, weather, and health data like heart rate.
Key Technical Challenges
Running sophisticated multimodal perception on a smartphone or smart glasses remains a significant technical hurdle. The most critical limitation today is memory bandwidth, Chandra said, as an agent must maintain persistent context while operating under strict compute, memory, and power budgets.
Chandra stated that the most important work over the next five to ten years will focus not on model size but on creating AI that understands more context about the user to provide daily assistance.
Four Areas of Breakthrough
Designing capable models for smartphone hardware requires advances in four areas: quantization (reducing model weight size), architecture (model shape), runtime optimization (eliminating wasted computation), and vision (efficient multimodal perception).
Quantization
For on-device models, quantization is essential. Meta’s ParetoQ, presented at NeurIPS 2025, demonstrated that for a fixed memory budget, larger models with more parameters at lower precision outperform smaller models at higher precision. The team also found that below 2 or 3 bits, the way models learn changes, though the reason is still under investigation.
Outliers pose a major problem for extreme quantization. Meta’s SpinQuant, published at ICLR 2025, learned a smoothing technique during training that allowed quantization below 4 bits without accuracy loss.
Architecture
Chandra explained that model shape—the size of layers versus the number of layers—can be optimized for on-device use. MobileLLM, presented at ICML 2024, showed that a tall and narrow architecture with smaller layers and more layers works better than a wide and shallow one for a fixed memory budget. The team also applied block-wise weight sharing and embedding sharing to reduce parameter counts, resulting in a family of sub-1B models that performed well on specific tasks.
More recent work, MobileLLM-Flash, presented at the 2026 Annual Meeting of the Association for Computational Linguistics, scaled the same model to 1B parameters and outperformed comparable models on various benchmarks. Increasing the context window and finetuning improved generalization for real use cases.
Runtime Optimization
Meta experimented with hardware-in-the-loop training, which uses a loss function on every forward pass to simulate results for specific hardware and optimize for latency. The technique reduced overall latency by half.
A reasoning model based on this work, MobileLLM-R1, was presented at ICLR 2026. Chandra noted that this work challenges the assumption that serious reasoning requires massive cloud models.
Speculative Decoding for Speed
To make on-device agents feel responsive, Meta employed speculative decoding. Instead of one large model generating tokens sequentially, multiple smaller or graph models generate many tokens simultaneously, and a target model verifies them, discarding incorrect ones. This approach reduces latency by a factor of two or three, Chandra said.
Vision and Video Understanding
Chandra called vision the most expensive sensing modality. Meta refined its SAM foundation model into EfficientSAM, then compressed and distilled it to SqueezeSAM. Further optimizations produced EdgeTAM for segmentation and tracking across multiple video frames on smartphone hardware, running at 16fps on an iPhone 15 Pro Max.
To handle video efficiently, the team discovered that many frames add no new information. LongVU, presented at ICML 2025, reduced token cost by an order of magnitude, enabling video understanding on edge processors. Another model, VideoAuto-R1, presented at CVPR 2026, reuses reasoning tokens for follow-up questions on the same video, avoiding repeated reasoning.
Depth Estimation
DepthLM, a vision-language model, estimates object distance from simple 2D camera images, opening use cases in physical AI.
On-Device Agent as Orchestrator
Chandra described an on-device agent as an orchestrator that understands everything a user sees and perceives, proactively offering needed information. He stressed that models must be built from the ground up for their target devices, starting from hardware constraints and designing upward. The result, he said, will not be a chatbot or the largest model, but a smart, efficient, distributed model that tells the user’s story over the next decade.
Search
RECENT PRESS RELEASES
Related Post
