Autonomous vehicles (AVs), or even vehicles with advanced driver assistance systems (ADAS), rely on data from a lot of sensors — multiple cameras, lidar, radar, and sometimes even sonar. Dealing with the streams of data coming from this array of sensors, is a huge and at the same time, critical, task. All of that data has to be turned into information in real time to be used to safely drive the car at least as well as if a perfect human driver were in control. “On the road, human drivers need to be wary of their current surroundings, interact with other drivers, and make decisions. Like human drivers, AVs should perceive, interact, and make decisions as well. Further, AVs should build a good relationship with their passengers.”1
These functions rely on artificial intelligence (AI) to assimilate the data from the different sensors and combine them for an instantaneous picture of the vehicle and its dynamic environment — the process called sensor fusion. Optimally for automotive use, the AI employs deep neural networks (DNNs). Modelled on the way information is processed by the human brain, DNNs learn how to traverse the real world of driving by learning from experience, rather than by being told what to do by a programmer. The DNN works by accepting multiple inputs, assigning different weights to them and drawing inferences. It requires an extremely high-performing yet energy-efficient computing platform in order to do all of that in “real time.” This can be sped up by using a processing accelerator.
Putting AI into Action
To learn what goes into a high-performing platform for automotive AI, I interviewed Gil Abraham of CEVA, Inc. about their tools for implementing it.
CEVA's Vision and AI Business unit has what Abraham called three pillars: NeuPro-M, which is the AI processor; SensPro, which is a high-performance sensor hub digital signal processor (DSP); and CDNN-Invite software. CDNN-Invite both allows manufacturers to input their own proprietary DNN accelerators to co-work with CEVA's NeuPro-M AI processor as well as the SensPro DSP and enjoy one unified system that can be centrally managed by the same memory and flow software development kit (SDK).
The SensPro DSP can perform signal processing on inputs from many different sensors, including multiple lidars, radars, and cameras and convolve them — a process called sensor fusion, which gathers together the inputs of many sensors. This is critical for automotive use because each sensor has its own limitations.
Cameras can have very high resolution so they can sense fine details. But you typically need more than one to cover vehicle blind spots, rear view, surround view, and several front facing cameras with different focal lengths. Each camera will be providing a separate data stream. Also, cameras can't be relied upon to function well at night or if they are facing into the sun.
Radar functions well at night and can provide distance measurements. But it does not have high enough resolution to perceive the precise location of an object or distinguish between multiple objects that are close to each other. It can fail to detect stationary or slow-moving objects and also cannot detect the “semantics” of the scene: color and shape of objects — for that you need cameras.
Lidar acts as its own light source so it performs well both in darkness and daylight. It also provides rapid and accurate measurement data with high enough resolution for precise real-time free-space detection while tracking multiple objects within a scene. But in order to have enough lidar data points you'd need five or six of these expensive devices mounted on a vehicle.
The SensPro DSP can take inputs from these, as well as time of flight (ToF) sensors, inertial measurement units (IMUs), efficiently process algorithms for simultaneous localization and mapping (SLAM) and using them all, can create “contextual awareness” — a full picture of the situation of the automobile and its environment. You can also add a dedicated instruction set in the hardware to get additional acceleration of other specific processing if needed.
Integrating the System
NeuPro-M addresses many of the key challenges of autonomous vehicle functions including high speed and low latency scalable operation, low operating power, high security, and the ability to meet the functional requirements of the ISO 26262 standard for safety-related electrical/electronic systems in production road vehicles. It is also scalable, so for example, it can be used for single sensors, clusters of sensors in a zone, or it can even be embedded as part of the automobile's engine control unit (ECU).
The Architecture of a High-Performing Automotive AI Platform
The function of the CEVA NeuPro-M AI processor is to make high-level driving decisions. These decisions are based on the DNN inputs that provide situational information, such as: Are there oncoming cars; is the automobile staying in lane; what is the absolute location (SLAM)?
The NeuPro-M AI processor is the decision-maker as to what actions to take, given the information it receives. AI requires a massive number of calculations to make decisions, and these must be done in close to real time. It therefore takes a powerful processor. For automotive applications, it is also necessary to minimize power consumption — as the number of calculations increases and the time to perform them decreases, the power consumed increases. The measure of the processor's performance can be summed up in one number: tera operations per second per watt (TOPS/watt) — the higher the number, the better the performance. The power efficiency of the NeuPro-M is 24 TOPS/watt, which is significantly higher than most other automotive AI processors.
Security is of utmost importance — in fact, you can say that security and safety are the most important requirements for automotive applications. The AI processor in an autonomous vehicle is the driver-in-charge. The neural networks that perform the AI processing assign weights to each of their inputs, and those weights are vulnerable to malicious tampering. Security systems are therefore an integral piece of the processor in order to guard against those attacks.
For optimizing performance, it's important to realize that no matter how fast the processor is, the system can run into a bottleneck because of bandwidth limitations on the movement of information to the processor, primarily because of system interfacing limitations in terms of memory. The computation units work much faster than the time it takes to store and fetch the huge amounts of data to inform the memory. One way this can be addressed is by directing a continuous flow of data to the processor rather than waiting for enough information to perform a particular calculation.
Another way performance can be optimized is to be adaptive — to use a modular adaptive topology. The processor topologies can be optimized for processing different types of sensors and performing different types of operations. One function, for example, might be to optimize the powertrain efficiency, which requires a lot of mathematics, or on the other hand it might just be processing a single sensor.
If you need to calculate something very accurately, you could use floatingpoint arithmetic instead of fixed-point, inside a vector processor. Another way to address bandwidth reduction, is to compress the data, so you don't have to move all of it in the system. These are software fixes, but you also have to look to optimizing the DSPs and the AI processor, as well, by doing a deep dive to address all of the bottlenecks in each of these areas.
The NeuPro-M (NPM) processor has three parts (See Figure 2): The master controller; the NPM common subsystem; and the NPM engine. The processor can include anywhere from one to eight engines, which can be selected to meet the needs of a particular application. The processor operation can be scaled by choosing the number of engines. “That's how you get more and more horsepower,” said Abraham.
The NPM common subsystem is in constant communication with the NPM engine. That channel is monitored to make sure that it does not become a bottleneck — to make sure data will keep flowing into the system. The inferencing AI runs with two datasets: the data itself, perhaps an image; and the weight, which is applied to the data in order to do the inferencing. The common subsystem keeps the channel open by applying compression to both the data and the weights.
Parallel processing can be implemented both by using multiple engines and also using the coprocessors within the engines, each of which contains five coprocessors and a shared internal memory.
Example — Controlling a Vehicle with a Four-Engine NPM
Figure 3 illustrates a simple automotive application of parallel processing. The left side of the figure shows an image of the road, which is captured by a front-facing camera. A processor within the vehicle blocks out the opposing lane to simplify the computations needed to keep the vehicle centered on its side of the road and stores the image in memory. The stored image is input from the vehicle's memory to the NPM common subsystem, which in this example is serving four engines. The software then decides what the use case is — what is needed — and how to divide the image in order to attain maximum performance with minimum power (high utilization) for the desired function. In this case, the NPM divides the image into four parts, with some overlap, and each part is sent to a different engine. The AI inferencing is then run on each of the four segments of the road. The four segments are then stitched back together into the subsystem memory, from which it is output to the perception layer elsewhere in the SoC, for performing the desired tasks.
This example illustrates the two levels of parallel processing, one by using the four engines to work on different segments of the image and within each engine, parallel processing by sharing the computations among the five internal coprocessors.
Optimization Via Software
AI functions chiefly through convolution, which is a mathematical operation on two functions that produces a third function expressing how the shape of one is modified by the other. The mathematician Shmuel Winograd innovated a new method of doing convolution in half of the usual number of steps. CEVA implemented this theoretical idea in their processors to achieve the same precision as would have been achieved with normal convolution, but with a nearly 2x acceleration — a gain in performance with a reduction in power. This can be done in each of the five coprocessors within the engine.
Another trick is to operate differently on different data types, depending on which would be optimal for a particular application. For example, simultaneous localization and mapping (SLAM), requires very high accuracy so you have to use floating point arithmetic. For other applications, a fixed number of bits would be perfect. In this way the automobile manufacturer can choose the computational method that works best for each function within the vehicle.
By using both software manipulation and hardware optimization, you can gain significant acceleration — up to 16x with NeuPro-M according to Abraham.
Summing it Up
This has been an overview of the internal functioning of a particular AI processor as it processes the data from a variety of sensors — radar, lidar, sonar, cameras — and makes decisions. The NPM is a heterogeneous processor — it can operate on different data types and optimize its operation as measured by TOPS/watt by using two levels of parallel processing as well as targeted design of the software.
- Fang Chen, PhD, SAE Edge Research Report — Unsettled Issues in Vehicle Autonomy, Artificial Intelligence, and Human-Machine Interaction.