Gemini Robotics: The Future of AI-controlled Robots

Google DeepMind Unveils Gemini Robotics AI Models That Can Control Robots in Real-World Environments

Google DeepMind recently announced the introduction of two new artificial intelligence (AI) models, Gemini Robotics and Gemini Robotics-ER. These cutting-edge models are designed to enable robots to perform a wide array of tasks in real-world settings with advanced spatial reasoning capabilities.

The Development of Gemini Robotics

In a blog post by DeepMind, Carolina Parada, the Senior Director and Head of Robotics at Google DeepMind, emphasized the importance of "embodied" reasoning for AI to be truly beneficial in physical environments. Gemini Robotics, the first of the two AI models, is an advanced vision-language-action (VLA) model created using the Gemini 2.0 model. This model introduces a new output modality of "physical actions," allowing direct control of robots.

Google DeepMind's Gemini Robotics AI Model for Next-Gen Humanoids

DeepMind highlighted three essential capabilities required for AI models in robotics to be effective in the physical world: generality, interactivity, and dexterity. Gemini Robotics excels in adapting to different situations, handling new objects, diverse instructions, and various environments. The model has demonstrated significant performance improvements on generalization benchmarks based on internal testing.

The interactivity of Gemini Robotics is powered by Gemini 2.0, enabling the model to understand and respond to commands in everyday language and different languages. The model continuously monitors its surroundings, detects changes, and adjusts its actions accordingly.

Introducing Gemini Robotics-ER

The second AI model, Gemini Robotics-ER, also a vision language model, focuses on spatial reasoning in real-world scenarios. Leveraging Gemini 2.0's coding and 3D detection, this model is adept at understanding spatial relationships and executing precise movements to manipulate objects effectively.

For instance, when presented with a coffee mug, Gemini Robotics-ER can generate commands for specific grasping actions to pick up the mug safely along a suitable trajectory. The model encompasses various steps, including perception, state estimation, spatial comprehension, planning, and code generation critical for controlling robots in physical environments.

Diverse AI Approaches for Enhanced Human-Machine Interaction

It is essential to note that both AI models are currently not publicly available. DeepMind intends to integrate these models into humanoid robots for further evaluation before potential future releases.

For more information, you can visit the official blog post by DeepMind or explore the Google DeepMind website.