In “Star Trek: The Next Generation,” Captain Picard and the crew of the U.S.S. Enterprise leverage the holodeck, an empty room capable of generating 3D environments, to prepare for missions and to entertain themselves, simulating everything from lush jungles to the London of Sherlock Holmes. Deeply immersive and fully interactive, holodeck-created environments are infinitely customizable, using nothing but language: the crew has only to ask the computer to generate an environment, and that space appears in the holodeck.
Today, virtual interactive environments are also used to train robots prior to real-world deployment in a process called “Sim2Real.” However, virtual interactive environments have been in surprisingly short supply.
“Artists manually create these environments,” said Yue Yang, a doctoral student in the labs of Mark Yatskar and Chris Callison-Burch, Assistant and Associate Professors in Computer and Information Science (CIS) at the University of Pennsylvania, respectively. “Those artists could spend a week building a single environment,” Yang added, noting all the decisions involved, from the layout of the space to the placement of objects to the colors employed in rendering.
That paucity of virtual environments is a problem if you want to train robots to navigate the real world with all its complexities. Neural networks, the systems powering today’s AI revolution, require massive amounts of data, which in this case means simulations of the physical world.
“Generative AI systems like ChatGPT are trained on trillions of words, and image generators like Midjourney and DALLE are trained on billions of images,” said Callison-Burch. “We only have a fraction of that amount of 3D environments for training so-called ‘embodied AI.’ If we want to use generative AI techniques to develop robots that can safely navigate in real-world environments, then we will need to create millions or billions of simulated environments.”
Enter Holodeck, a system for generating interactive 3D environments co-created by Callison-Burch, Yatskar, Yang and Lingjie Liu, Aravind K. Joshi Assistant Professor in CIS, along with collaborators at Stanford, the University of Washington, and the Allen Institute for Artificial Intelligence (AI2). Named for its “Star Trek” forebear, Holodeck generates a virtually limitless range of indoor environments, using AI to interpret users’ requests.
“We can use language to control it,” said Yang. “You can easily describe whatever environments you want and train the embodied AI agents.”
Holodeck leverages the knowledge embedded in large language models (LLMs), the systems underlying ChatGPT and other chatbots.
“Language is a very concise representation of the entire world,” said Yang. Indeed, LLMs turn out to have a surprisingly high degree of knowledge about the design of spaces, thanks to the vast amounts of text they ingest during training. In essence, Holodeck works by engaging an LLM in conversation, using a carefully structured series of hidden queries to break down user requests into specific parameters.
Here is an exclusive Tech Briefs interview, edited for length and clarity, with Yang and Callison-Burch.
Tech Briefs: What was the biggest technical challenge you faced while developing Holodeck?
Callison-Burch: “Holodeck” is a name that's borrowed from “Star Trek: The Next Generation” with the idea that you can create this immersive 3D environment in the show just by talking to the computer and describing what you want. That feels like what we were doing.
One technical challenge was we've seen the success of generative AI at creating text and creating images and even now creating music and voices and things like that. So, the goal was how do we leverage some of the information in these systems to create realistic 3D environments? We're not generating the 3D images from scratch, like you are in the image-generation programs. Instead, what we're doing is selecting from this huge collection of a million 3D objects called the Objaverse, and then we’re using the knowledge that's present in large language models to arrange and position them in a way that's realistic.
The application that we were focused on for this paper was creating realistic indoor simulated environments to train robots. The past work at AI2 had shown that if you first trained robots through reinforcement learning in a simulated environment, then when you deploy them in real life, they do better than if you don't pretrain them. And so by using this methodology we were able to create a much more diverse set of indoor environments for this simulated-to-real transition for the robot training.
Yang: The biggest challenge was how to position the objects, how to arrange them coherently in a room. This required a lot of human knowledge. For example, you have a dining room table and a lot of chairs. How should you place them? You should surround the chairs around the table.
It is hard for the AI model to learn this ability, but we can leverage knowledge from large language models. Large language models know you should face the chairs to the table; you should put them around the table. We can distill this knowledge to help design the layout of the room. That's a big challenge we’ve overcome.
Another challenge was the diversity of objects. If you want a very creative room, it'll require some specific 3D objects that we may not have in our database. But in our latest system, we can use text-to-3D models to generate 3D objects.
Callison-Burch: To elaborate, a huge challenge for AI is, a lot of human knowledge to us is common sense. And what we need to do is figure out how to transfer common sense knowledge that we all have as human beings into a form that a computer program could execute? So that idea of which direction should a chair face with respect to a table — anyone could do that automatically, any person — it's obvious. But to come up with an explicit program that does that for all objects that you would find in an entire house is really difficult. So, using the knowledge that gets encoded into these large language models, we're able to convey that without having to explicitly program it.
Tech Briefs: Can you explain in simple terms how it works, please?
Yang: The user input is just a textual prop; the output is a 3D environment, fully interactive. So, we just decompose this generation process into several steps. We have a floor module: we can design a floor plan, choose some materials for the floors and walls; we can have a doorway and window module: how many windows do I need, how should I connect the two rooms? You select these objects from a database, then the layout module is about how you should arrange those objects. How should you position them in a reasonable way? It’s a step-by-step pipeline. And, at each step, we prompt large language models to get the output reward. It's like a conversation.
We have a conversation with the large language model to get the information we want, and we convert that information into the specifications for the 3D environments.
Callison-Burch: The first goal of this was indoor household scenes, because that was very useful for our robot training.
Tech Briefs: What are your next steps? Where do you go from here? Do you have any plans for further research or work?
Yang: We do have some plans for future work. After we get those environments, then the next step is what can we get robots to do in those environments? In previous research, the robots could only do some very simple tasks. One task is called object navigation — It just navigates to an object.
But with the help of Holodeck, you can generate diverse rooms, like a game room. So, I think a natural next step is how should we design diverse tasks and train robots to do more complicated things with Holodeck environments.
Callison-Burch: Following along that idea, robots aren't only for navigation. One of the really interesting things about this method that we developed, is we're distilling knowledge about the world from large language models and putting it into embodied agents. There are lots of other kinds of knowledge that could be added. Things like how do you operate a stove or how do you interact with it to turn it on? Or what are the steps you need to take to plan to make a meal? All those kinds of things are also present in a large language model, and I think they could be transferred to these embodied agent tasks.
Another thing that I think is quite an interesting direction is: I've been teaching a course here on applying AI to games. We have a course on interactive fiction and text generation, so thinking about how these sorts of methodologies could be useful for game design and other types of interactive fiction is also an exciting future direction. Right now, we're generating things that are common household scenes, but you can imagine extending this to create fantastical environments as well.