(Image: Robin Pierre)

A team of researchers at North Carolina State University has developed a technique that allows artificial intelligence (AI) programs to better map three-dimensional (3D) spaces using two-dimensional (2D) images captured by multiple cameras. Because the technique works effectively with limited computational resources, it holds promise for improving the navigation of autonomous vehicles.

“Most autonomous vehicles use powerful AI programs called vision transformers to take 2D images from multiple cameras and create a representation of the 3D space around the vehicle,” said Corresponding Author Tianfu Wu, Ph.D., Associate Professor of Electrical and Computer Engineering at North Carolina State University. “However, while each of these AI programs takes a different approach, there is still substantial room for improvement.

“Our technique, called Multi-View Attentive Contextualization (MvACon), is a plug-and-play supplement that can be used in conjunction with these existing vision transformer AIs to improve their ability to map 3D spaces. The vision transformers aren’t getting any additional data from their cameras, they’re just able to make better use of the data,” said Wu.

MvACon effectively works by modifying an approach called Patch-to-Cluster attention (PaCa), which Wu and his collaborators released last year. PaCa allows transformer AIs to more efficiently and effectively identify objects in an image.

“The key advance here is applying what we demonstrated with PaCa to the challenge of mapping 3D space using multiple cameras,” Wu said.

To test the performance of MvACon, the researchers used it in conjunction with three leading vision transformers: BEVFormer, the BEVFormer DFA3D variant, and PETR. In each case, the vision transformers were collecting 2D images from six different cameras. In all three instances, MvACon significantly improved the performance of each vision transformer.

“Performance was particularly improved when it came to locating objects, as well as the speed and orientation of those objects,” said Wu. “And the increase in computational demand of adding MvACon to the vision transformers was almost negligible.

“Our next steps include testing MvACon against additional benchmark datasets, as well as testing it against actual video input from autonomous vehicles. If MvACon continues to outperform the existing vision transformers, we’re optimistic that it will be adopted for widespread use.”

Here is an exclusive Tech Briefs interview, edited for length and clarity, with Wu.

Tech Briefs: What was the biggest technical challenge you faced while developing this technology?

Wu: There are several challenges, but one of the most challenging parts is, since we only have multi-view, each one of them is just 2D. And we want to do 3D object detection, then one of the most challenging technical aspects is how can we lift the 2D inputs into the 3D in an effective and efficient way. That’s the challenge we tried to address and build on top of some of the existing pipeline out there that people have developed for addressing this 3D object detection.

Tech Briefs: Can you explain in simple terms how it works?

Wu: For example, if we have multiple 2D and we want to lift 2D to 3D, that means we want to come up with a unified representation — such as that the representation in the 3D can talk to every position in the 2D. That's why we developed the technique; we used one of our previous methods called PaCa. We use that to fuse the 2D visual to 3D space, such as that the 2D-to-3D lifting can be much more robust.

Tech Briefs: How did this and your prior work come about? What was the catalyst for these projects?

Wu: The PaCa is a new architecture for the transformer model. When we work on the PaCa, we only focus on how we can improve the efficiency of the transformer model, which was originally developed for language and often suffers from so-called quadratic complexity. So, we addressed that challenge in the PaCa. That was by another student.

For this new work, the student tried to improve the 2D-to-3D lifting challenge. Then we realized our previous PaCa can adapt for this 2D-to-3D lifting purpose.

Tech Briefs: You’re quoted in the article I read as saying, ‘Our next steps include testing MvACon against additional benchmark datasets, as well as testing it against actual video input from autonomous vehicles.’ Do you have plans for this further testing?

Wu: We want to be able to see, for example, on one of the benchmarks using benchmark we can run on the full data set, but there's another benchmark: Waymo. In our paper, we’ll only be able to test it on Waymo-Mini; it’s kind of like a subset of Waymo created for academic exploration. So, we try to scale it up in two ways: One, we'll try to test our full Waymo data set to see whether our method will be applicable to different benchmarks. Another aspect we want to scale up is can we do this cross-benchmark talk. For example, the model we learned can we make it work better on Waymo — such as, we'll need less data to train on Waymo and improve the testing performance.

That's one aspect. Another aspect is for this 3D object detection, we have two previous works, and we want to combine them all together.

Right now, we are still in the testing phase. The student has graduated, so we have a new student coming in. She will work on this, but we have a plan. We will try to do step-by-step experiments and verification as needed.

Tech Briefs: You’re also quoted in the article I read as saying, ‘If MvACon continues to outperform the existing vision transformers, we’re optimistic that it will be adopted for widespread use.’ How soon do you think we could see it adopted?

Wu: To be honest, we don't have the answer, but we will try our best to push this unique component, which is also like a component in all the autonomous driving, especially camera based. That particular part is essential to almost all methods. And that's why we believe once we’re able to show it as larger scale across benchmarks — and if we can consistently show improved performance, not only just numerically but also qualitatively — we can improve the interoperability, because now we know how 2D gets lifted to 3D due to that PaCa. So, when all of that is accomplished, we will try our best to sell it to see whether it can be adopted by an existing company.