Today, Google launched the preview of a new state-of-the-art robotics embodied reasoning model, Gemini Robotics-ER 1.5. If you’ve ever used Gemini Live with a camera, which allows the model to see what you see, imagine giving it a physical body. That’s how I understand the potential of this model.
The capabilities include point and finding objects to planning trajectories and orchestrating long horizon tasks.
While similar to other Gemini models, Gemini Robotics-ER 1.5 is purpose-built to enhance robotic perception and real-world interaction. It provides advanced reasoning to solve physical problems by interpreting complex visual data, performing spatial reasoning, and planning actions from natural language commands.
Google API Docs
Project Goals
Since I’m not planning to buy additional hardware, I’ll be using a couple of RoboMaster TT drones I already have to see if I can get them to navigate an environment autonomously to carry out instructions. An example would be to “go downstairs and tell me if you see a person”. This is something I’ve been thinking about for quite awhile and this new model seems like the perfect solution to pull it off.
A stretch goal is to have each drone act as an agent to perform coordinated tasks. Those tasks wont be anything impressive, and have yet to be determined, but the main goal is to learn and explore possibilities. One example might be to coordinate tasks between the two drones to accomplish checking for a person downstairs by passing information between them using their dot matrix screens.
I’d also like to include the ability to talk to them to give them tasks to carry out, even if it’s through my laptop. There’s an ESP-32 attached to each, so I could try to add a mic and speaker, but that’s added weight and complexity for initial experimentation.
Drone API
The Google example for calling a custom robot API seems to be the best fit for my use case. Its telling a robotic arm put a blue block in an orange bowl.
prompt = f"""
You are a robotic arm with six degrees-of-freedom. You have the
following functions available to you:
def move(x, y, high):
# moves the arm to the given coordinates. The boolean value 'high' set
to True means the robot arm should be lifted above the scene for
avoiding obstacles during motion. 'high' set to False means the robot
arm should have the gripper placed on the surface for interacting with
objects.
def setGripperState(opened):
# Opens the gripper if opened set to true, otherwise closes the gripper
def returnToOrigin():
# Returns the robot to an initial state. Should be called as a cleanup
operation.
The origin point for calculating the moves is at normalized point
y={robot_origin_y}, x={robot_origin_x}. Use this as the new (0,0) for
calculating moves, allowing x and y to be negative.
Perform a pick and place operation where you pick up the blue block at
normalized coordinates ({block_x}, {block_y}) (relative coordinates:
{block_relative_x}, {block_relative_y}) and place it into the orange
bowl at normalized coordinates ({bowl_x}, {bowl_y})
(relative coordinates: {bowl_relative_x}, {bowl_relative_y}).
Provide the sequence of function calls as a JSON list of objects, where
each object has a "function" key (the function name) and an "args" key
(a list of arguments for the function).
Also, include your reasoning before the JSON output.
For example:
Reasoning: To pick up the block, I will first move the arm to a high
position above the block, open the gripper, move down to the block,
close the gripper, lift the arm, move to a high position above the bowl,
move down to the bowl, open the gripper, and then lift the arm back to
a high position.
"""
Here’s an example of the drone’s API functions that could be called instead to allow the model to explore its environment:
# take off
flight.takeoff().wait_for_completed()
# fly up 100 cm (1 m)
flight.up(distance=100).wait_for_completed()
# fly forward 200 cm
flight.forward(distance=200).wait_for completed()
# rotate clockwise 90°
flight.rotate(angle=90).wait_for_completed()
# land
flight.land().wait_for_completed()
I’ll be sharing how it goes and will add links to articles related to this project here.
Until then, you can learn all about this model on the Google Developers blog article that announced the release.