Google's driverless car, like many autonomous vehicles in development, is packed with sensors, radars, lasers, and cameras. To lighten the load of sensing devices in self-driving autos, researchers at Cambridge University have hit the streets with a system that uses only computer vision.
The imaging technologies — one that identifies objects and another that orients the objects in space — were tested throughout central Cambridge, UK. The Cambridge systems rely on supervised learning techniques to “see” and “understand” a location and surroundings.
The architecture that supports the localization and object-recognition capabilities, a deep convolutional encoderdecoder neural network, provides two separate outputs critical to autonomous robotics: where you are and what is around you.
SegNet: The ‘What’
A neural network uses training examples to infer rules. The deep convolutional neural network from Cambridge University takes an image and, from its pixels, extracts hierarchical characteristics. The hardware initially picks up lowlevel cues, like edges, corners, or shape. At a “deeper” level, the network learns the relationship between the traits to construct representations of objects: a tree, a bike tire, or a side mirror, for example.
For the neural network to recognize objects, a process called semantic segmentation must assign pre-defined class labels to each of an image’s pixels. The network’s segmentation system, SegNet, classifies an image into one of 12 categories, including roads, street signs, and pedestrians — on a per-pixel basis.
To “train” the software, Alex Kendall, a PhD student in the university’s Department of Engineering, and other Cambridge scholars provided examples and manually labeled each pixel in an estimated 5000 images. The software, in real time, then detects the traits that determine recognizable objects.
Although the idea of labeling millions of pixels is as tedious as it sounds, the students employed clever tricks to simplify the task. Kendall and the team, led by professor Roberto Cipolla, turned the millions of pixels into “super-pixels,” effectively grouping similar areas of an image together. The researchers then only needed to label the much smaller set of larger-scale pixels.
Structure from Motion: The ‘Where’
SegNet decides “what” the image is; the neural network’s localization system answers the question of “where.”
To enable the localization technology and output an object or individual’s 3D place in space, training data must again be gathered. In April of 2015, Kendall and his team explored the two-kilometer span of King’s Parade in Cambridge, collecting example images. The localization dataset was taken from smartphone video as a pedestrian walked around the urban environment.
Using a computer vision technique called “Structure from Motion,” the researchers instructed the system by providing the images with their corresponding location. The “motion” part of the process refers to the movement of the camera.
As the camera progresses throughout a scene, the Structure from Motion technique constructs the 3D geometry by computing corresponding points, like landmarks, between successive video frames. The corresponding points triangulate an estimate for the camera’s position and the landmark’s position simultaneously.
Using Structure from Motion, each frame is labeled with a six-degree-of-freedom camera pose. The neural network then uses the dataset as training information to output camera pose when presented with an image.
The orientation technology operates in locations where GPS is not present, such as in museums, tunnels, or places in the city where a reliable signal is unavailable.
“It doesn't explicitly require, say, a database of landmarks like previous systems do,” Kendall said. “It implicitly learns within the system.”
The large amount of training data also helps to address the challenge of a variance in an image’s light, appearance, and geometry. The abundance of examples, according to Kendall, improves the robustness of the system and helps the technology more easily determine locations in real time.
Taking Computer Vision for a Spin
Along King’s Parade in central Cambridge, the university researchers demonstrated the effectiveness of the localization technology. By analyzing a color photo, the deep-learning software detected location and orientation within a few meters.
To incorporate data from around the globe, the system now includes ten million parameters. The localization system has also been tested in museum environments, a residential house, and a factory. SegNet was primarily trained in highway and urban environments.
Both systems — SegNet and the orientation technology — could someday be employed to address challenges associated with self-driving cars, such as navigation and object detection. SegNet, for example, could be used to identify road obstacles and provide a warning system for drivers, Kendall said.
The technologies are still in their early stages, however, and may find their way to the living room before the drivers’ seat. The systems could support the guidance of domestic robots, like autonomous vacuum cleaners, or assist the blind, for example. Other applications beyond autonomous ro bots include augmented reality, which requires knowledge of a camera’s location before bringing scenes to the viewer.
Kendall, who is also currently working on placing all of the processing on a smartphone as an app, ultimately sees the computer vision’s role on the road.
“At the moment, the solutions you see from the big tech companies overload the car with sensors like laser and radar,” Kendall said. “We’re stepping towards a model that really mimics the human visual system, and tries to demonstrate that maybe we can do this with just vision alone in a way that’s robust enough to be safe on our everyday roads.”