Building multimodal AI for Ray-Ban Meta glasses

March 5, 2025

Multimodal AI – models capable of processing multiple different types of inputs like speech, text, and images – have been transforming user experiences in the wearables space.

With our Ray-Ban Meta glasses, multimodal AI helps the glasses see what the wearer is seeing. This means anyone wearing Ray-Ban Meta glasses can ask them questions about what they’re looking at. The glasses can provide information about a landmark, translate text you’re looking at, and many other features.

But what does it take to bring AI into a wearable device?

On this episode of the Meta Tech Podcast, meet Shane, a research scientist at Meta who has spent the last seven years focusing on computer vision and multimodal AI for wearables. Shane and his team have been behind cutting edge AI research like AnyMAL, a unified language model that can reason over an array of input signals including text, audio, video, and even IMU motion sensor data.

Shane sits down with Pascal Hartig to share how his team is building foundational models for the Ray-Ban Meta glasses. They talk about the unique challenges of AI glasses and pushing the boundaries of AI-driven wearable technology.

Whether you’re an engineer, a tech enthusiast, or simply curious, this episode has something for everyone!

Download or listen to the episode below:


You can also find the episode wherever you get your podcasts, including:

The Meta Tech Podcast is a podcast, brought to you by Meta, where we highlight the work Meta’s engineers are doing at every level – from low-level frameworks to end-user features.

Send us feedback on InstagramThreads, or X.

And if you’re interested in learning more about career opportunities at Meta visit the Meta Careers page.

Links

Timestamps

  • Intro 0:06
  • OSS News 0:56
  • Introduction Shane 1:30
  • The role of research scientist over time 3:03
  • What’s Multi-Modal AI? 5:45
  • Applying Multi-Modal AI in Meta’s products 7:21
  • Acoustic modalities beyond speech 9:17
  • AnyMAL 12:23
  • Encoder zoos 13:53
  • 0-shot performance 16:25
  • Iterating on models 17:28
  • LLM parameter size 19:29
  • How do we process a request from the glasses? 21:53
  • Processing moving images 23:44
  • Scaling to billions of users 26:01
  • Where lies the optimization potential? 28:12
  • Incorporating feedback 29:08
  • Open-source influence 31:30
  • Be My Eyes Program 33:57
  • Working with industry experts at Meta 36:18
  • Outro 38:55