Generative AI Systems Can Produce Robot-Training Simulations

(Image: YouTube via University of Washington)

Researchers working on large artificial intelligence models like ChatGPT have vast swaths of internet text, photos, and videos to train systems. But roboticists training physical machines face barriers: Robot data is expensive, and because there aren’t fleets of robots roaming the world at large, there simply isn’t enough data easily available to make them perform well in dynamic environments, such as people’s homes.

Two new studies from University of Washington researchers introduce AI systems that use either video or photos to create simulations that can train robots to function in real settings. This could significantly lower the costs of training robots to function in complex settings.

In the first study, a user quickly scans a space with a smartphone to record its geometry. The system, called RialTo, can then create a “digital twin” simulation of the space, where the user can enter how different things function (opening a drawer, for instance). A robot can then virtually repeat motions in the simulation with slight variations to learn to do them effectively. In the second study, the team built a system called URDFormer, which takes images of real environments from the internet and quickly creates physically realistic simulation environments where robots can train.

“In a factory, for example, there’s a ton of repetition,” said URDFormer lead author Zoey Chen, a UW doctoral student at the Paul G. Allen School of Computer Science & Engineering. “The tasks might be hard to do, but once you program a robot, it can keep doing the task over and over and over. Whereas homes are unique and constantly changing. There’s a diversity of objects, of tasks, of floorplans, and of people moving through them. This is where AI becomes really useful to roboticists.”

Here is an exclusive Tech Briefs interview, edited for length and clarity, with Chen.

Tech Briefs: What was the biggest technical challenge you faced while developing URDFormer?

Chen: The biggest challenge was to come up with a framework that is general for diverse scenes and objects. We want to have one versatile framework that predicts structures for not only just single articulated objects such as cabinets and ovens but also entire scenes such as kitchens and bedrooms. I remember spending a lot of time thinking about what humans do when reasoning about spatial structures. For example, if I’m a 3D designer who wants to create a digital twin of the kitchen from an image, I would first draw the high-level layout: What objects are on which wall, then go into each individual object to fill in the details such as how many drawers and doors, and how are they connected. That inspired me to develop the current URDFormer framework.

Intuitively, the framework consists of a global network and a part network, which share the same architecture, but are trained independently using different data. Global network is trained to recover the high-level layout of the scene from the image and the part network is trained to fill in the details of each object.

Tech Briefs: How did this project come about? What was the catalyst for your work?

Chen: One of the biggest challenges of robotics is that we don’t have tons of useful real-world data to train robots, so many researchers have also been looking into training robots in simulation because it’s safe and fast. Well, it turns out that creating 3D assets and digital scenes in simulation can also take a lot of time. For example, it might take more than 900 hours for a professional 3D artist to create a digital environment of one apartment. To train a robot that generalizes, we need a lot of apartments and objects, which, as a result, can be really expensive.

At the same time, there are millions of images of objects and home designs on the internet. So, we said, ‘Why don’t we train a network to do this task, that given images of your home or downloaded from the internet, and predicts the digital copies at scale?’ We want to predict digital scenes fully articulated so we can place a robot there and train it for different tasks really efficiently. That motivates us to work on this project. It’s also funny that I had a similar idea in the first year of my Ph.D., which is using generative models to create photorealistic data from a physics simulation and use the data to train robots to reason about articulation.

However, the generative models were not really powerful at that time, so the quality of the training data was not really great. I’m really glad that I have a chance to come back to give it another try, thanks to the progress in GenAI.

The biggest advantage of URDFormer is that it doesn’t require a video scan or a depth camera and directly works on a single image. This gives us a chance to generate a digital environment from the internet cheaply at scale.

Tech Briefs: Can you explain in simple terms how URDFormer works?

Chen: URDFormer takes an image of your home and predicts its corresponding digital version that’s fully interactive. You can load this into a physics simulator and train a robot to learn different tasks. To do this, the biggest question is where to get the data, such as we can train a network that takes real-world images and predict simulation-ready digital copies. Often, we can cheaply generate random objects and scenes in simulation, but the rendered images look synthetic.

We address this problem by leveraging a generative model and converting these synthetic images into corresponding photo-realistic images. So, now we have their realistic images and all the parameters of the scenes to train a network. Since the training data are photo-realistic, the trained network can now take real-world images and predict all the parameters such as articulations and layouts.

Tech Briefs: Do you have plans for further research/work/etc.?

Chen: One thing we found through URDformer experiments is that even the predicted digital scenes slightly mismatch the real world, we can still train a robot by augmenting the prediction such as shifting the handle or switching the drawer shape. Intuitively, the robot seen similar kitchens would perform better compared to a robot that has seen a bunch of random and entirely different kitchens.

One interesting future research is to come up a better data augmentation method that can cover the real-world distribution more efficiently. This includes more reasonable shape augmentation and common-sensible physics parameters such as object mass and friction. We are also interested in how we can pretrain a generalizable robot policy first in many simulation environments so that the robot can safely do basic tasks, then fine-tune using real-world demonstrations so it can perform tasks requiring higher precision.

Transcript

00:00:00 (graphics swoosh) (soft music) - One of our missions has been to put robots in places they are most needed. Do you love doing chores every day you wake up, doing your laundry, putting dishes in a dishwasher, cleaning your home? Well, I don't. What we have been trying to make are these robots that can go into our houses and do everything that we want

00:00:37 or don't want them to be doing. So what we have developed is a whole new paradigm where you can walk with your iPhone, scan your own kitchen, upload it onto the web, where we can create a digital replica of the kitchen, and then the robot can collect large amounts of data, practicing how to perform the tasks that you want the robot to perform. And once after practicing the digital twin, the robot is ready.

00:01:05 We can bring it back into the real world. So training in the virtual world or simulation is very powerful because a robot can practice millions and millions of times. It might have broken a thousand dishes, but it does not matter because everything was in the virtual world. The output is a reliable, robust policy, which can change and adapt. Even if the shelf was moving,

00:01:30 even if the object was a different location, the robot will still work. - Exactly. So what we did next is I just took my phone and started scanning the different objects in the scene, so the mug and the shelf. And then we took these scanned assets and we brought them into the simulator. Now what we can do is that the robot can practice in simulation and do it,

00:01:50 like, much, much faster, millions of times faster than if it was doing it in the real world. And the way the robot learns, it learns fully autonomously. And what it does is like it receives a reward if it takes the mug and puts it on the shelf. And this algorithm is called reinforcement learning. It's the same way that we teach our dogs to, for example, give us a hand. So the policy that controls the robot is taking as input

00:02:13 the camera that we see right here. This camera transforms the image into a point cloud, which means that for each point in the image, it says at which depth or which coordinate is it placed? And then you have this 3D perception of the environment. - So let's look at the robot reaching the cup. Marcel moves it and the robot reacts to it. - [Marcel] And it successfully achieves the task, even though it has never been trained on the real world. So some of the limitations

00:02:42 of training policies in simulation is that not everything can be simulated. For example, liquids or deformables are still hard, but there's research in progress and we're gonna get there hopefully soon enough. - Okay, so this small demonstration is on the pathway of a major breakthrough in robotics by collecting huge amounts of data, but completely in simulation. We are very excited

00:03:04 about what we have accomplished over here, but there are major updates coming soon. So stay tuned.

September 9, 2024 | Imaging | Robotics, Automation & Control | AR/​AI | Photonics/​Optics

Generative AI Systems Can Produce Robot-Training Simulations

Two new studies introduce AI systems that use either video or photos to create simulations that can train robots to function in real settings.

Transcript

Top Stories

Blog: Manufacturing & Prototyping

INSIDER: Energy

Blog: Design

Blog: Green Design & Manufacturing

Quiz: Automotive

Blog: Manufacturing & Prototyping

Webcasts

Upcoming Webinars: AR/AI

Upcoming Webinars: Robotics, Automation & Control

Podcasts: Power

Podcasts: Medical

Podcasts: Energy

Podcasts: Information Technology

September 9, 2024
| Imaging | Robotics, Automation & Control | AR/AI | Photonics/Optics