You CAN Teach a Robot New Tricks

Watch this video to see researchers at MIT's CSAIL use vision models to teach robots limits — enabling more intricate and diverse task completion, done safely and correctly.

“LLMs and classical robotics systems like task and motion planners can’t execute these kinds of tasks on their own, but together, their synergy makes open-ended problem-solving possible,” says PhD student Nishanth Kumar SM ’24  , co-lead author of a new paper about PRoC3S. “We’re creating a simulation on-the-fly of what’s around the robot and trying out many possible action plans. Vision models help us create a very realistic digital world that enables the robot to reason about feasible actions for each step of a long-horizon plan.”



Transcript

00:00:00 (logo whooshing) (gentle music) - Recently, there's been a lot of excitement on using large language models and vision language models, these large scale machine learning models that have been trained on the entire internet for robotics tasks. So there are all these constraints from kinematics to reachability to collision.

00:00:32 If we don't properly consider these constraints, the robot is gonna fail at its task. The research challenge we're interested in is doing decision making with these models, but still considering these important constraints that occur on the robot. - In this work, we developed a system called process. It uses a language model to establish a block of code. In this block of code, it tells the robot what to do in a procedural manner.

00:00:56 So once we have a goal that we want the robot to satisfy, we query the language model for a block of code that defines an infinite family of solutions. This infinite family of solutions is then searched through using a constraint satisfier. And what this constraint satisfier does is it basically can test out in simulation different possible sequences of actions and validate their correctness under the set of constraints.

00:01:17 - So for experiments, our goal was to test the range of tasks that we could get our method to solve on a real robot. Importantly, we were interested in drawing and general manipulation tasks that the robot should be able to solve in a zero shot fashion. So it hasn't seen these tasks ahead of time. And the kinds of tasks we tested were things like draw a star or some other shape and take a bunch of objects like fruits, blocks,

00:01:43 bowls, everyday objects, and pack them into configurations like lines, et cetera. We noticed that a lot of the baselines would violate constraints. They would knock into other objects, they wouldn't understand the goal and wouldn't actually accomplish it, or they would fail to reach objects, whereas because our approach explicitly reasons about these constraints,

00:02:02 we're able to solve these tasks at a much higher success rate. So here, we're gonna see the robot try to solve a task where we've told it stack all the blocks into a bowl of the same color. - One implication of this work is that we can now trust the process, so we can put robots in a arbitrary environment and give them an arbitrary goal, and have them operate in that environment in a safe manner,

00:02:26 satisfying all of the constraints that the human set forth beforehand. So in future work, we hope to extend our method to more partially observable and mobile-based situations. - This recipe of combining large language models with classical techniques from robotics like constraint satisfaction, and putting them together in an integrated system is really powerful. It lets us solve tasks that are open ended,

00:02:48 but also maintain trust and safety. In the long term, this could enable us to solve tasks in household environments, like cooking and cleaning, and assist users in their everyday lives.