A software system was developed that helps robots more effectively act on spoken instructions — no matter how abstract or specific those instructions may be — from people who by nature give commands that range from simple and straightforward, to those that are more complex and imply a myriad of subtasks.
The issue addressed in this work is language grounding in which a robot takes natural language commands and generates behaviors that successfully complete a task. Commands, however, may have different levels of abstraction that can cause a robot to plan its actions inefficiently, or fail to complete the task at all.
For example, imagine someone in a warehouse working side-by-side with a robotic forklift. The person might say “grab that pallet” to the robotic partner. That highly abstract command implies a number of smaller sub-steps — lining up the lift, putting the forks underneath, and hoisting it up. Other common commands might be more fine-grained, involving only a single action: “Tilt the forks back a little,” for example.
Those different levels of abstraction can cause problems for current robot language models. Most models try to identify cues from the words in the command as well as the sentence structure, and then infer a desired action from that language. The inference results then trigger a planning algorithm that attempts to solve the task. But without taking into account the specificity of the instructions, the robot might over-plan for simple instructions, or under-plan for more abstract instructions that involve more sub-steps. That can result in incorrect actions or an overly long planning lag before the robot takes action.
The new system adds an additional level of sophistication to existing models. In addition to simply inferring a desired task from language, the new system also analyzes the language to infer a distinct level of abstraction. That results in increased speed in performance when executing tasks, compared to existing methods.
To develop the new model, the researchers used Mechanical Turk, Amazon’s crowdsourcing marketplace, and a virtual task domain called Cleanup World. The online domain consists of a few color-coded rooms, a robotic agent, and an object that can be manipulated; in this case, a chair that can be moved from room to room.
Mechanical Turk volunteers watched the robot agent perform a task in the Cleanup World domain; for example, moving the chair from a red room to an adjacent blue room. Then the volunteers were asked to say what instructions they would have given the robot to get it to perform the task they just watched. The volunteers were given guidance as to the level of specificity their directions should have. The instructions ranged from the high-level (“take the chair to the blue room”) to the stepwise-level (“take five steps north, turn right, take two more steps, get the chair, turn left, turn left, take five steps south”). A third level of abstraction used terminology somewhere in between those two.
The volunteers’ spoken instructions were used to train their systems to understand what kinds of words are used in each level of abstraction. From there, the system learned to infer not only a desired action, but also the abstraction level of the command. Knowing both of those things, the system could then trigger its hierarchical planning algorithm to solve the task from the appropriate level. The system was then tested in both the virtual Cleanup World and with an actual Roomba®-like robot operating in a physical world similar to the Cleanup World space. When a robot was able to infer both the task and the specificity of the instructions, it responded to commands in one second 90 percent of the time. In comparison, when no level of specificity was inferred, half of all tasks required 20 or more seconds of planning time.