The presence of more than 1 billion sensor-rich smartphones and the intense interest surrounding the Internet of Things has drawn wide attention to all the potential and possibilities of sensor fusion engines. Availability of context data and general real-world data in digital format opens up many opportunities.

But what exactly is sensor fusion? Edge devices capture the analog world through temperature, motion, moisture, or other data. The mystery, however, is the “fusion,” where all the software innovation is taking place.

Measuring the speed of an athlete, for example, requires more than just an accelerometer. Among various inputs, one needs to determine the athlete’s direction of travel. Accelerometers are sensitive to many factors over short periods. To adjust for the shortcoming, gyroscope data produces reliable, short-term angle estimates, and a magnetometer’s readings correct for any “sensor drift” or inaccuracies.

Performing a fusion of the data from the three sources gives a more precise estimate of the speed of the athlete. Data from each sensor corrects for the shortcomings of the others. The theory behind the operation and accuracy of each sensor type is complex, and the fusion algorithms are seen as valuable intellectual property.

The larger the number of data sources, the more complex the fusion algorithms and the closer we get to the real-life analog context. If sensor fusion is the process by which data points from multiple sensors are combined to extract the best estimate of a system context being observed, then how does one go about building such a thing?

#### Step 1: Get Your Analog Sensor on the Bus

You do not need to be an expert in the physics of the sensor to integrate it in a System on Chip (SoC). To create a fusion engine, the sensor requires a register set and an Advanced Peripheral Bus (APB) interface.

The Advanced Peripheral Bus is designed for low-bandwidth control accesses — for example, register interfaces on system peripherals such as sensors. The APB includes an address and data phase that have a reduced, low-complexity signal list. If the data throughput from the sensor is high, then one can consider the Advanced High-performance Bus (AHB), an industry-standard bus protocol (see Figure 1).

When a sensor cannot be physically integrated into the SoC, one can instead integrate the peripheral interface — I2C or SPI, for example — required to communicate with the sensor. Industry-standard interface blocks are available from many sources.

#### Step 2: Design Tradeoffs

Once all the sensors are on the bus, the rest of the system must be built with an eye on keeping the balance between cost, power, and productivity.

For cost constraints, it is important to reduce the number of components in the device. Having only one processing unit is valuable, as opposed to multiple dedicated ones, such as a microcontroller and a DSP (digital signal processing) block. (Keep in mind that some microcontrollers also have DSP capabilities.)

A DMA (direct memory access) block optimizes power capabilities. The processor block cannot constantly handle the incoming data from all sensors, regardless of the rate. Most of the system has to be put to sleep for the maximum amount of time; otherwise the power consumption will be too high. A smart DMA block will handle the incoming data points and only “wake up” the processor when there is enough data that merits processing.

One can build a highly complex design — with distributed processing for extreme low power — but then spend years trying to build the software. Productivity is about getting the solution out the door in a reasonable amount of time. The best setup for software development is a sufficient choice of tools and a simple debug process (see Figure 2).

#### Step 3: The Fusion Algorithms

To write the software that will analyze and fuse the data from the various sensors, the right processor must be selected. In order to achieve the correct amount of processing power, users need to determine the complexity of the algorithms that are needed. Could a small general purpose processor be used, for example? Or are high-performance, DSP, or even SIMD instructions required?

Setting the data types and structures is typically a large part of the design of any algorithm. For sensor data, there is a choice between three types: fixed-point, single-precision, and double-precision floating point.

Integer is a fixed-point data type; variables use 32 bits for positive and negative whole numbers. The data type has less dynamic range than floating-point data options.

Storing real numbers as integers may also result in conversion and arithmetic round-off errors. Examples include output from sensors measuring real-world values, such as audio signals using 12-bit A/D output, or image files using 8 or 24 bits per pixel. In integer math, rules must be established for when the result of an arithmetic operation is greater than the upper bound of the integer data type (i.e., when the variable saturates as an operation result exceeds the container) and when the result is less than the lower bound.

Single-precision, a floating-point data type, has a reasonable dynamic range; variables use 32 bits to accurately represent values to approximately seven decimal places. Single-precision is ideal for storing and processing real number values where a lower level of accuracy than double precision is acceptable.

Double-precision is a floating-point data type with a larger dynamic range; variables use 64 bits to accurately represent values to approximately 15 decimal places.

With fixed-point variables, the gaps between adjacent numbers are always exactly one. In floating point notation, the gaps between adjacent numbers vary over the represented number range. If math can be done on integer and single-precision data, then one can store such data more efficiently and avoid having to convert it to doubles before processing. Fixed-point has saturation issues, while floating point has round-off errors that cause the value of the final result — after many operations — to gradually drift away from its expected value.

Variable types must be selected to match the data being processed. A processing unit needs to enable operations with maximum precision, minimum errors, and reasonable memory and power usage. An easy choice would be to use double precision for all variables, but that leads to doubling in the required amount of memory. Additionally, power consumption increases significantly since all operations take place on 64 bits.

#### Step 4: Finding the Right Instructions

There are many processors and DSP engines on the market that support these data types. In the end, there is no automatic decision process that will output the right processor choice. The decision is tightly coupled to the set of algorithms that will be used and the precision required for each. Keep in mind that all processors will run any code. There are many variations of libraries for single- and double-precision floating point operations, for example.

Power and performance must be balanced. A floating point operation using the library function may take up to ten times the processing time compared to performing the same operation on a processor that has a floating point unit. When making design choices, various criteria must be balanced. Otherwise, one would take the most advanced high-performance processor — or an array of those — to build the fusion engine.

In selecting a processor, it is important to confirm that an instructions set fits the operations to be performed on the sensor data. For example, a single-cycle multiply with a 32-bit result is a critical instruction to have for sensor processing algorithms that perform many multiplications.

There are a number of other examples. The long multiply instruction multiplies two 32-bit integers and saves the result in a 64-bit integer. The instruction is essential for implementation of the multiplication of a 32-bit fixed-point number.

The flexible Operand 2 command allows for shifting by a constant or a register. In fixed-point algorithms there are several shifts necessary, for instance when operating on or converting between fixed-point variables that have different formats.

Additionally, a hardware divide provides a measurable performance benefit over performing the same divide in software. The integer division command offers a 32-bit divide. For fixed-point math, a common case is to perform a divide of a 64-bit number by a 32-bit number, yielding a 32-bit result.

A floating point unit dramatically speeds up algorithms that cannot be easily adapted to fixed-point computation, like matrix decomposition, which otherwise would rely on a soft-float implementation. In addition, the fixed-point math can be further improved by leveraging many of the DSP extensions to the instruction set, including SIMD and instructions with saturating arithmetic.

A double-precision floating point unit, or compute block, is critical if the choice is made to use the double-precision data type; otherwise the processing time may be off by a factor of 10.

#### Step 5: Output the Context

The SoC design is done, the algorithms are written and optimized, the fusion operation is tuned for the application in mind, and all that remains now is to integrate with a host. Usually, for cost considerations, the fusion engine is always a slave to a host. At system power-up, the host will download the software to the internal memory. The download removes the need to embed flash memory in the fusion engine, reducing power consumption and cost.

During the operation of the whole system, the host expects to receive indications of major context changes via the host interface — one as simple as an SPI interface. There is no industry- standard application program interface (API), however, major mobile operating systems have defined their own APIs for the exchange.

#### Conclusion

Building a sensor fusion engine is a straightforward exercise given the amount of technology available off the shelf in the market today. A good sequence to follow is to focus first on the desired overall context output. Then, select the algorithms that are required to extract that context. For each of those algorithms, examine the operations required for the computations used, and map those to processor instructions for optimal performance. If many iterations are expected while number crunching, then make sure to select the appropriate data types to avoid round-off and saturation errors.

Finally, keep an eye on unexpected potential optimizations. For example, some processors have what is referred to as a Multiply Accumulate Unit (MAC). Instead of using a multiply instruction followed by an addition instruction, one could use the MAC to do the multiply add operation in one single step. The upgrade dramatically speeds up all sorts of filtering operations. Selecting the most suitable processing unit will deliver the desired output with a delicate balance of cost, power, and productivity.

This article was written by Diya Soubra, senior product marketing manager for ARM Cortex-M Processors, ARM, Inc. (San Jose, CA). For more information, Click Here .