How Language Helps Robots Get a F3RM Grasp
Inspired by humans' ability to handle unfamiliar objects, a group from MIT designed Feature Fields for Robotic Manipulation (F3RM), a system that blends 2D images with foundation model features into 3D scenes to help robots identify and grasp nearby items. Watch this video to learn more.
Transcript
00:00:01 (air whooshing) (inspirational spacey music) - Today we're gonna look at a demo of our project Feature Fields for Robotic Manipulation. - The purpose of this project is to give these robots the ability to understand the world in 3D. - [William] We're going to get the robot to grasp these objects on the table using open-ended language. So what the robot's doing now,
00:00:31 is it's picked up this camera mounted on a selfie stick and it's going to take 50 RGB images of a scene. What this gives us is a holistic understanding of the entire scene. - [Ge] So neural radiance field is a way to turn these images into a 3D model of the scene that can render hyper photorealistic renders of the scene. - And in addition to just the neural radiance fields, we also train a 3D feature field. And this 3D feature field lifts features
00:00:57 from 2D Vision foundation models, which are trained on internet scale data sets into a 3D feature representation. - [Ge] The robot hasn't seen these objects before during training. In fact, the robot doesn't even know which object we're gonna ask it to pick up until we tell it so this is what makes this task open-ended. - Here we've exported a point cloud of the scene using the neural radiance field we just trained.
00:01:17 As you can see, we can go from 2D images to this 3D representation. So now let's ask the robot to grasp specific objects via language. Here you can see the grasps overlaid on top of that point cloud. And now we can ask the robot to find a motion plan to execute on the robot. Here's a visualization of what we expect the robot to do, which is pick up Baymax
00:01:36 and can see the robot's finding a path to grasp Baymax, pick it, and then place it into the bin on the other side of the table. Now let's see the robot planning to grasp the bowl, and the important thing is that the robot has never seen how to grasp a bowl before. It's going to pick it and then place it into the bin. - [Ge] Now we're gonna test if the robot can pick up the spatula. The spatula's a pretty difficult object
00:01:57 because it's held as a very awkward angle and it's also very thin, it's difficult to grasp. So what you just saw, is the robot being able to pick up all of these objects despite that it has never seen these objects before. - And in the future, we really want to make this representation more real-time so we can enable robots to do more dynamic tasks in the household, in the factory, and in manufacturing environments.

