Let's say two vehicles are heading right at each other down a one-way street.
If you're behind the wheel in this kind of tight, challenging driving scenario, you can negotiate with the parties nearby. You can pull over to the side of the road and then motion for the driver ahead to pull through the thin lane. Through interaction, you can figure out maneuvers that keep everybody safe and onto their destination.
A self-driving car has a tougher challenge, and must somehow understand nearby drivers, and their willingness to play nice.
A new algorithm under development can guide an autonomous vehicle through tough traffic on a crowded, narrow street.
The algorithm, built by researchers at the Carnegie Mellon University Argo AI Center for Autonomous Vehicle Research , makes its decisions by modeling different levels of driver cooperativeness — how likely a driver is to pull over to let another driver pass.
With "Multi-Agent Reinforcement Learning," or MARL, the team, led by researcher Christoph Killing, got autonomous vehicles to exhibit human-like behaviors, including defensive driving and interpreting the behavior of other agents — in simulation, so far.
The algorithm has not been used on a vehicle in the real world, but the results are promising, thanks to the model's reward-based system.
"We incentivize interactions with safety in mind," said Killing, a former visiting research scholar in the School of Computer Science's Robotics Institute and now part of the Autonomous Aerial Systems Lab at the Technical University of Munich.
{youtube} ttps://www.youtube.com/watch?v=5njRSHcHMBk {/youtube}
In a short Q&A with Tech Briefs below, Christoph explains more about how his team's incentive-based model navigates tough traffic situations, where there are no official rules of the road.
Tech Briefs: Would you characterize your model as more cooperative or aggressive, when navigating a challenge that requires a little bit of both?
Christoph Killing: As in any driving scenario, autonomous vehicles should put safety first and follow all traffic rules. However — and this is the beauty and challenge of the scenario considered — there exists no coordinating traffic rules in this kind of scenario (in contrast to 4-way stop intersections, for example). Two vehicles of equal right of way have to negotiate essentially who goes first and who waits.
If both vehicles are purely focused on safety, they will both pull over. The key challenge we were faced with in our research was: How do we make one vehicle pull over and one go — not to make both vehicles pull over, not to make both vehicles go, when each makes their own decisions without any coordinating instance.
We incentivize interactions with safety in mind; crashing at speed is worse than timing out — but time-outs also result in a small penalty to incentivize agents to learn to interact and pass by each other.
Tech Briefs: What are the main parameters that your model is using to execute the drive? What criteria is the algorithm basing its decisions off of?
Christoph Killing: Our algorithm perceives what would be available on an actual car. We have distance and relative velocity measurements around the front of the car (see Fig. 2 in the report here ). Notably, compared to related work, we do not use a birds-eye view on the scenario but an egocentric perspective. This makes it a little bit trickier since we now have blind-spots. This observation is augmented by further parameters, such as the cooperativeness mentioned above to tell the agent how aggressive to behave, but also the current steering angle and throttle position (which you would also know of when driving yourself in this scenario).
Tech Briefs: What is still challenging for the algorithm to get right?
Christoph Killing: There are two main challenges: overly aggressive pairings and overly passive pairings. ( Compare the visualizations here .) Notably, our policies are able to negotiate the scenario most of the times. Yet, human passengers might be quite unhappy with their cars doing some of the maneuvers shown here .
Tech Briefs: What does the algorithm do when it’s clear that an opposing driver is being an aggressively, “bad” driver? Or an overly “cooperative” driver?
Christoph Killing: We test our driving policies by assigning a cooperativeness value to each vehicle, telling it how aggressive to behave. Each only knows about its own cooperativeness, not about the one of the opposing car. These cooperativeness values translate to driving behaviors in a quite straight forward manner: An uncooperative driver is only interested in its own progress. A highly cooperative driver doesn’t mind which vehicle makes progress first, as long as somebody goes. These values are fixed throughout the interaction.
(We do not consider “losing your temper.” I am not going to deep dive here but let’s just keep it at “for mathematical reasons.")
Tech Briefs: Does part of the model require a kind of “read” of the opposing driver?
Christoph Killing: A word about the “read”: In robotics, we distinguish between the state of the world (i.e., the planet Earth as it is right now) and an observation. Our vehicles do not have a memory module. So, how do we deal with things we do not see at the moment?
Let’s say, for instance, that you are on a Zoom call with somebody. You perceive a partial observation of the planet Earth so to say. The other party takes a coffee mug from outside the field of view of their camera, takes a sip, and puts it back down outside their camera's field of view. If you only take into consideration the very last observation you made after the mug was put down and are being asked what they drink, you simply do not know (because there is no memory). Yet, if you stack together (we call it "concatenate") several observations throughout the past seconds, you can infer something about the state of the world as you then see the mug being moved throughout several frames. Based on how rapidly they move it, you might even be able to tell something about their mood.
Equally, in our scenario, each car only knows the other agent, based on what it can observe from the observation space ( shown in Fig 2. in the paper ). Internal states (the cooperativeness value of the other car, for example) are unknown. We concatenate several of those partial observations of each vehicle to allow them to implicitly form a believe about how cooperative the other vehicle might be. We don’t manually do this but have the Deep Neural Network, the artificial intelligence, absorb the task. This Neural Net also has to learn the answer to your question, namely what to do after it noticed a certain aggressiveness or overly cooperative behavior.
Tech Briefs: How does the model note an "aggressive" or "cooperative" behavior, and respond accordingly?
Christoph Killing: An overly aggressive agent might, for instance, just proceed right into this bottleneck of the scenario, essentially forcing the other agent to wait. An overly cooperative agent would — as soon as the full extent of the bottleneck is perceivable by its sensors — slow down and wait. Here our policy is trained to immediately select the complementary action: detect a slow-down and go, or vice versa.
Tech Briefs: What’s next for this research?
Christoph Killing: Plenty of things: Three major points: Firstly, the current work is autonomous vehicle confronted with autonomous vehicle only. We will need to extend this to an autonomous vehicle confronted with a human and see how well we do cooperating with those. Secondly, in our work vehicles can move forward only, we do not allow reversing. However, this could help recover from situations where we are stuck. Thirdly, our work currently is simulation only. Transferring it to a real-world solution is a major step we need to take at some point.
What do you think? Share your questions and comments below.
Transcript
00:00:03 welcome to our presentation on learning to robustly negotiate bi-directional lane usage in high conflict driving scenarios and thank you for your interest in our work the scenario we address today is depicted on the right consider an autonomous agent perceiving such a scenario
00:00:23 it intends to drive down the road shown but another vehicle is coming towards it part cars locally reduce the width of the available space human drivers would usually manage to negotiate the scenario successfully while formalizing our scenario the intent and degree of collaboration of the other road users remains unknown
00:00:50 there also exists no coordinating traffic rules and there are no trivial or automatic solutions our goal therefore is to enable agents to robustly negotiate with opponents of unknown cooperativeness key design choices include the pure usage of local egocentric observations the absence of communications and no
00:01:12 centralized control structure we will approach the problem using multi-agent maximum entropy reinforcement learning but let us firstly look at the model of our problem we consider the resulting interactions of two agents to be a behavior level markov game to model this we differentiate the
00:01:36 decision making into three levels each with a different time horizon according to michonne at their strategic level the long-term goals of each vehicle are formulated such as traversing the scenario with a certain cooperativeness or c at the core of our approach
00:01:54 is the behavior level where controlled high-level action patterns are generated on the control level these behaviors are translated into steering and throttle commands using conventional methods these can follow a selected behavior for several consecutive time steps which ensures continuous interaction of our agents
00:02:20 in our scenario we have multiple agents interacting with the same environment considering the scheduling of actions one usually models a simultaneous game where all agents either select actions at the same time or a turn-taking game to resemble the unknown timing of decisions and to remove any prior knowledge about the action
00:02:41 schedule we instead randomize the behavior generation frequency to an expected value of 4 hertz while low level controls run at 20 hertz to translate varying degrees of cooperativeness into varying policy behaviors we use a parameterized reward function of r
00:03:03 of c at the beginning of each episode the cooperativeness of each vehicle is randomly drawn from a uniform distribution between 0 and 0.5 on cooperative agents of c0 are consequently only rewarded for their own velocity while highly cooperative agents are indifferent to which vehicle makes progress first
00:03:29 each agent only knows about their own cooperativeness not about the one of the opposing vehicle this formulates a general sum game finally let me introduce the high level behaviors used in our work we allow agents to pick an action from follow the shared lane pull over or stop in the shared lane
00:03:53 which also introduces some notion of explainability in our approach the according observation space is based on a high dimensional noise-free environment represented here on the right notably it scales the impact an opposing agent has on the ego agent by the distance our approach to solving the presented
00:04:20 problem is discrete asymmetric soft actor critic why do we need it in reinforcement learning we usually have one agent interacting with its environment the environment is always returning the reward according to the same mechanisms in other words it is doing the same thing
00:04:42 in multi-agent reinforcement learning this is no longer true other agents become part of an agent's environment but their behavior changes throughout the learning process therefore environments are no longer stationary related work in discrete action spaces address this issue using decentralized
00:05:04 training or by exploiting the symmetry of the reward function in fully competitive or fully cooperative games other solutions in continuous action spaces use centralized training with decentralized execution we bring this approach to discrete
00:05:22 action spaces using soft policy iteration the centralized training increases the stability of our learning process while maximum entropy reinforcement learning allows our agents to learn multiple solution modes sample efficiency is achieved through off policy learning
00:05:46 we train two neural networks a q function based on the state and a policy based on observation the boltzmann policy is proportional to the exponential of the q function we then minimize the kl divergence between the state-based boltzmann policy and the local policy which can also be thought of
00:06:10 as behavior cloning on local observations and iterate over the process several times now let us briefly look at a video of three resulting interactions our agents are in self-play both being controlled by the same policy and in the same environment the only variable parameter
00:06:35 is the cooperativeness of the blue agent at the very top a highly cooperative blue agent is faced with an uncooperative red agent in the middle interactions with an average cooperative blue agent are shown the interactions of two rather uncooperative agents are shown at the bottom while the videos play please pay
00:07:00 attention to how much the overall interactions change by just varying one parameter namely the cooperativeness of the blue vehicle quite clearly our approach of modeling varying driver behaviors through a parametrized reward function is working in an intuitively explainable way
00:07:43 we are now going to compare various approaches and conduct a large scale analysis of the effects just observed as this scenario has not previously been addressed we devised two rule-based decision makers as baselines you can find more details in the paper we compare our dassac agent against a stabilized dqn
00:08:06 and sql policies to evaluate the effect of centralized training and maximum entropy reinforcement learning we train all policies in self-play we use prioritized experience replay dueling network architectures and target networks for all approaches quite clearly the sac performs best at success rates above 99
00:08:34 details on unencountered policies and the curriculum learning approach can be found in the paper too so why do we report performance and spread we are interested in finding policies that are highly robust towards any cooperativeness of the opposing vehicle consequently performance of a policy is reported as
00:08:58 the rate of successful traversals at the best performing cooperativeness of the eagle vehicle consider the graphics shown for a dqn policy vertically we depict the rate of successful traversals horizontally we show the cooperativeness of an eagle vehicle
00:09:21 increasing from the left since we are purely interested in robustness which vehicle is the eagle vehicle red or blue and which the opposing is irrelevant for this evaluation encoded in colour are the cooperativeness values of the opposing vehicle we find the highest performance at a
00:09:45 medium cooperativeness of the eagle vehicle the spread then quantifies the drop in performance towards less favorable pairings from our dassac one can observe results which are a lot more stable when it comes to extreme pairings we attribute this to this centralized training for which policy behaviors
00:10:08 are critical in the extreme pairings of driver cooperativeness and where does dassak improve most upon our learned baselines in the following we show two corner cases first consider two uncooperative agents each is interested in maximizing their own progress some human passengers might be rather
00:10:33 unhappy with the resulting interactions second consider two highly cooperative agents these do not have a preference on which agents make progress first this extremely cautious and indecisive approach is equally inefficient in conclusion we introduced a new
00:11:35 scenario challenging the social component of automated vehicles our policies are highly robust towards unobservable degrees of cooperativeness of the opposing vehicle towards previously unencountered policies and also towards the unknown timing of opponent decisions our success rates are outperforming related work on
00:11:59 multi-agent reinforcement learning for automated driving should you be interested to find out more you can find further information on our project thank you for your attention

