(Image: YouTube via University of Washington)

Researchers working on large artificial intelligence models like ChatGPT have vast swaths of internet text, photos, and videos to train systems. But roboticists training physical machines face barriers: Robot data is expensive, and because there aren’t fleets of robots roaming the world at large, there simply isn’t enough data easily available to make them perform well in dynamic environments, such as people’s homes.

Two new studies from University of Washington researchers introduce AI systems that use either video or photos to create simulations that can train robots to function in real settings. This could significantly lower the costs of training robots to function in complex settings.

In the first study, a user quickly scans a space with a smartphone to record its geometry. The system, called RialTo, can then create a “digital twin” simulation of the space, where the user can enter how different things function (opening a drawer, for instance). A robot can then virtually repeat motions in the simulation with slight variations to learn to do them effectively. In the second study, the team built a system called URDFormer, which takes images of real environments from the internet and quickly creates physically realistic simulation environments where robots can train.

“In a factory, for example, there’s a ton of repetition,” said URDFormer lead author Zoey Chen, a UW doctoral student at the Paul G. Allen School of Computer Science & Engineering. “The tasks might be hard to do, but once you program a robot, it can keep doing the task over and over and over. Whereas homes are unique and constantly changing. There’s a diversity of objects, of tasks, of floorplans, and of people moving through them. This is where AI becomes really useful to roboticists.”

Here is an exclusive Tech Briefs interview, edited for length and clarity, with Chen.

Tech Briefs: What was the biggest technical challenge you faced while developing URDFormer?

Chen: The biggest challenge was to come up with a framework that is general for diverse scenes and objects. We want to have one versatile framework that predicts structures for not only just single articulated objects such as cabinets and ovens but also entire scenes such as kitchens and bedrooms. I remember spending a lot of time thinking about what humans do when reasoning about spatial structures. For example, if I’m a 3D designer who wants to create a digital twin of the kitchen from an image, I would first draw the high-level layout: What objects are on which wall, then go into each individual object to fill in the details such as how many drawers and doors, and how are they connected. That inspired me to develop the current URDFormer framework.

Intuitively, the framework consists of a global network and a part network, which share the same architecture, but are trained independently using different data. Global network is trained to recover the high-level layout of the scene from the image and the part network is trained to fill in the details of each object.

Tech Briefs: How did this project come about? What was the catalyst for your work?

Chen: One of the biggest challenges of robotics is that we don’t have tons of useful real-world data to train robots, so many researchers have also been looking into training robots in simulation because it’s safe and fast. Well, it turns out that creating 3D assets and digital scenes in simulation can also take a lot of time. For example, it might take more than 900 hours for a professional 3D artist to create a digital environment of one apartment. To train a robot that generalizes, we need a lot of apartments and objects, which, as a result, can be really expensive.

At the same time, there are millions of images of objects and home designs on the internet. So, we said, ‘Why don’t we train a network to do this task, that given images of your home or downloaded from the internet, and predicts the digital copies at scale?’ We want to predict digital scenes fully articulated so we can place a robot there and train it for different tasks really efficiently. That motivates us to work on this project. It’s also funny that I had a similar idea in the first year of my Ph.D., which is using generative models to create photorealistic data from a physics simulation and use the data to train robots to reason about articulation.

However, the generative models were not really powerful at that time, so the quality of the training data was not really great. I’m really glad that I have a chance to come back to give it another try, thanks to the progress in GenAI.

The biggest advantage of URDFormer is that it doesn’t require a video scan or a depth camera and directly works on a single image. This gives us a chance to generate a digital environment from the internet cheaply at scale.

Tech Briefs: Can you explain in simple terms how URDFormer works?

Chen: URDFormer takes an image of your home and predicts its corresponding digital version that’s fully interactive. You can load this into a physics simulator and train a robot to learn different tasks. To do this, the biggest question is where to get the data, such as we can train a network that takes real-world images and predict simulation-ready digital copies. Often, we can cheaply generate random objects and scenes in simulation, but the rendered images look synthetic.

We address this problem by leveraging a generative model and converting these synthetic images into corresponding photo-realistic images. So, now we have their realistic images and all the parameters of the scenes to train a network. Since the training data are photo-realistic, the trained network can now take real-world images and predict all the parameters such as articulations and layouts.

Tech Briefs: Do you have plans for further research/work/etc.?

Chen: One thing we found through URDformer experiments is that even the predicted digital scenes slightly mismatch the real world, we can still train a robot by augmenting the prediction such as shifting the handle or switching the drawer shape. Intuitively, the robot seen similar kitchens would perform better compared to a robot that has seen a bunch of random and entirely different kitchens.

One interesting future research is to come up a better data augmentation method that can cover the real-world distribution more efficiently. This includes more reasonable shape augmentation and common-sensible physics parameters such as object mass and friction. We are also interested in how we can pretrain a generalizable robot policy first in many simulation environments so that the robot can safely do basic tasks, then fine-tune using real-world demonstrations so it can perform tasks requiring higher precision.