Every day we use complex mathematics, hidden inside products. When we listen to a song from an MP3 player, we are using the Fast Fourier Transform (FFT), a “complex domain” algorithm. The best car and home entertainment system amplifiers have at least one Digital Signal Processor (DSP) performing complex math. Hands-free telephones have DSPs to perform acoustic echo canceling and background noise reduction using complex math.
The applications using complex math are countless, especially in the audio and communications realm. It is not surprising that the first chapters of so many DSP handbooks present complex arithmetic as a fundamental tool.
In mathematics, the complex numbers are an extension of the real numbers that are used to describe the solution to every polynomial equation, which would be impossible to solve using just ordinary real numbers. The possibility of finding the solution to complex polynomial equations is strictly related to the ability to mathematically describe the behavior of any “linear system” including analog electronic circuits and DSP algorithms. The complex numbers can be formally described as a pair of real numbers (a, b) or, using a more conventional definition, combining the real (a) and imaginary (b) parts of the complex number using the imaginary unit “i”: x=a+i · b. “i” is the “impossible to compute” square root of -1. Fortunately the formal description of complex numbers as a pair of real numbers allows for the implementation of their arithmetic as a combination of the arithmetic of real numbers without dealing with “impossible” operations, as we will see later.
Complex mathematics is only rarely mapped into the processing structures of digital signal processors (DSPs), in spite of its importance for engineering and signal processing. The lack of complex mathematics in DSPs has two main implications. The first implication is the fact that the programmer is forced to “think real”, while the data to be processed are complex. He must translate the complex calculation into real math. This situation substantially increases programming complexity and slows down the development of the software.
For example the Fast Fourier Transform (FFT) inner compute kernel, known as “FFT butterfly”, using complex numbers is simply expressed as follows:
X’= X + W · Y;
Y’ = X - W · Y;
With no mechanism in the DSP processor to process the complex numbers, the FFT butterfly must be expressed as follows:
TR = WR · YR - WI · YI;
TI = WR · YI + WI · YR;
X’R = XR + TR;
X’I = XI + TI;
Y’R = XR - TR;
Y’I = XI - TI;
The resulting number of instructions is 3 times higher!
The second implication is that the parallelism and data reuse inherent in the complex arithmetic is lost if the complex arithmetic itself is not directly mapped in hardware. Performance suffers.
Hardware vs. Software
Directly implementing complex operations in the DSP hardware leads to significant architectural savings over simply adding more “real” operators. Those savings include smaller size of the instruction in terms of required number of bits, and lower power consumption. As a result the limited additional hardware required is rewarded by a factor of 4 to 5 in performance improvement with respect to the traditional DSP architectures based on the real “Multiply and Accumulate” (MAC) core. Overall natively supporting complex mathematics allows the user to write code more easily, which will be both simpler and faster and in addition will lower the cost and make the hardware architecture more efficient.
The basic idea is to directly map in hardware the fundamental complex algebra operations: multiplication and addition. Given two complex numbers x=a+i·b and y=c+i·d, the complex multiplication operation is:
x·y=(a+i·b)·(c+i·d)= (a·c-b·d) + i(a·d+b·c)
To execute the operation we need 4 ordinary (real) multiplications, one real addition and one real subtraction. To natively implement the complex math it is necessary to execute in parallel the real operations that compose the complex operation. So the basic complex number multiplier operator will be composed by a compute structure with 6 real operators, i.e. with high inherent parallelism. Moreover the data already fetched from the memory can be used twice to compute the results (see previous equation), reducing the number of accesses to the registers and to the memory and thus the power consumption.
Given the same two complex numbers x=a+i·b and y=c+i·d the complex addition operation is:
x+y=(a+b) + i·(c+d)
and in this case we need only two real addition operators. One of the fundamental operations of the DSP is the MAC:
To support the execution in a single clock cycle of the complex MAC, the multiplication and the addition must be performed in parallel. In conclusion the number of operators available in parallel must be 6+2=8. With a simple extension to the standard adders it is possible to generate simultaneously the addition and the subtraction of complex numbers, providing optimal support to the execution of the FFT. This produces as a result a total execution parallelism of 10 floating point operations per single clock cycle. The resulting core structure natively supporting complex signal processing is shown in Figure 2.
When processing real signals, such as audio signals, an important property of the spectrum of the signal is the fact that its samples are symmetric conjugate, that is: X(k)=X*(-k) or explicitly X(k)=A+iB and X(-k)=A-iB
This property is often used while processing audio signals and it requires the ability to perform operations involving the conjugate to compute X*·Y or even the double conjugate of the samples of the Fourier transform to compute X*·Y*. Another example of the usage of these properties is the observation that the inverse Fourier transform of a signal is (apart from a constant factor) the direct Fourier transform, computed using a complex conjugate set of coefficients.
The natural extension of the instruction set of a processor natively supporting complex domain operation is thus the ability to perform conjugate and double conjugate operations offering the most natural way of expressing the processing algorithms.
In order to minimize the development time of the applications it is necessary to have an efficient C compiler. The complex numbers are so important that extensions to C have been defined starting from C99 standard to support the complex arithmetic and with the addition of the so called “intrinsics”, it is possible to define additional operations that are not within the standard in order to add the support of operations on conjugate numbers. The syntax of a C instruction operating a multiplication on complex numbers will be identical to the multiplication on two real numbers (x*y) and a complex domain algorithm will be composed of expressions looking exactly the same as the expressions in the textbook, greatly simplifying code writing, debugging and maintenance.
Moreover the availability of 10 floating point operators also offers the opportunity of exploiting them to execute “vectorial” and “scalar” operations. As an example, parallel real operations can be performed on the “side 0” and on the “side 1” of the operator block (see Figure 1), providing high parallelism for the conventional real algebra. In addition to the support of complex computation, in high quality audio applications the computation precision is fundamental, and the standard 32-bit float is quite often insufficient, while normally the full double precision of 64-bit exceeds the requirement, implying loss of performance and a waste of silicon area and power. The best trade-off is adopting an “extended precision” floating point format. The 40-bit format, with 32-bit significand (or mantissa) and 8-bit exponent, offers the best compromise between the required precision and the implementation cost in terms of operating frequency, silicon area, and power. The block diagram of the complex domain 40-bit DSP processor is shown in Figure 3.
The advantage of an architecture combining the extended precision floating point and the native support of complex math, with respect to the usage of traditional fixed point architectures is impressive for high quality audio applications. To give an example, if we want to perform 32-bit precision complex multiplication using a 16-bit fixed point processor, making the simplifying hypothesis that a 32-bit result is always wide enough to represent the result, we need to perform 4 multiplications and 4 additions in 16-bit for each floating point multiplication, and 2 additions in 16-bit for each floating point addition. The complex MAC that can be performed as a single operation by the complex float architecture, needs 36 fixed point operations in 16-bit. The advantage of the complex float architecture is even more evident when you take into account the fact that in many real world cases, additional shifts are actually needed. Just think of the fact that while computing a 1024 points FFT, the result must be represented with 10 additional bits [log2(1024)=10] if the intermediate results are not shifted. So starting from a 24-bit audio from a BlueRay disk, you need 34-bit to represent the output of the FFT and thus intermediate shifts must be implemented. The order of magnitude of the performance improvement when using complex float architecture with respect to using fixed point architectures to process audio is thus about a factor of 40.
The advantage of using 40-bit extended precision floating point with respect to using the standard 32-bit floating point can be understood by iteratively performing a simple algorithm to generate a sinusoidal signal. The sinusoid is generated by rotation of a complex vector. When using the 32-bit arithmetic the “coherency time” (time taken to reduce the amplitude of the generated sinusoid by a factor 0.9) is about 2 minutes. When using 40-bit arithmetic the coherency time is around 2 hours. This better performance in generating sinusoids is strictly related to the ability of the processing core to perform filtering on low frequency signals with sharp filters, this is a crucial benchmark of high quality audio applications.
Nowadays, however, audio applications do not only require raw computing capabilities. The possibility of interfacing audio devices with other equipment is vital, and the “standard” PC interfaces are becoming more and more a must have for high end embedded audio applications. To mention some cases: music synthesizers need a file system to easily store and retrieve the database of files containing the MIDI songs, while a car audio amplifier is connected to the car vehicle network and needs SW drivers and protocol stacks that are easily available under the standard operating system and difficult to implement as standalone applications. Ethernet connection is becoming ubiquitous, as is happening with the USB.
Operating systems (OSs), such as Linux embedded or Windows CE, simplify the implementation of the user interfaces. Running an OS minimally requires an ARM9. Thus, the optimal solution for high quality embedded audio applications is a system-on-chip (SoC) that integrates an ARM9 processor and a powerful complex-domain floating point DSP processor, equipped with a generous set of input/output peripherals (Figure 4).
Audio systems supporting the 7.1 standard are increasingly common, so the multichannel audio input and output interfaces must support at least 16 audio channels. This can be obtained by including a set of configurable Synchronous Serial Controller (SSC) on the chip, that can work also as Time Division Multiplexing (TDM) I2S port to interface the high quality off-chip AD/DA converters (e.g. 24-bit resolution). Configuration ports like the Two Wire Interface (TWI) or the Serial Peripheral Interface (SPI) are required to program the behavior of the audio converters or to interface slower or lower resolution converters (16-bit or less). Adding general purpose interfaces like USB or Ethernet on- chip is also a good idea due to their wide usage in all applications, including audio. It is interesting to note that BMW’s audio engineers have considered replacing specialized buses, like the Controller Area Network (CAN), with Ethernet because of significant cost reduction and performance improvements. Nevertheless the CAN is still considered a must-have for in-car applications. Timers, Parallel Input and Output (PIO) ports and the Universal Synchronous/Asynchronous Receiver-Transmitter (USART) complete the system with the support of the most common and well known interfaces used in embedded systems.