According to a report by the World Health Organization (WHO), each year about 1.35 million people die in traffic accidents and another 20 to 50 million are injured. One of the main causes is driver inattention. Consequently, many automotive manufacturers offer driver assistance systems that detect tiredness. But it’s not just microsleep at the wheel that causes accidents. Talking or texting on smartphones and eating or drinking while driving cause high risk. Until now, driver assistance systems have been unable to identify these activities. ARRK Engineering (P+Z Engineering GmbH, Munich, Germany) has therefore run a series of tests towards automatically recognizing and categorizing mobile phone use and eating/drinking. Images were captured with infrared cameras and used for machine learning by several Convolutional Neural Network (CNN) systems. This created the basis for a driver assistant that can reliably detect various scenarios at the wheel and warn the driver of hazardous behavior.

For years, the automotive industry has installed systems that warn of driver fatigue. These driver assistants analyze, for example, the viewing direction of the driver, and automatically detect deviations from normal driving behavior. “Existing warning systems can only correctly identify specific hazard situations,” according to Benjamin Wagner, Senior Consultant for Driver Assistance Systems at ARRK Engineering. “But during some activities like eating, drinking, and phoning, the driver’s viewing direction remains aligned with the road ahead.” For that reason, ARRK Engineering ran a series of tests to identify a range of driver postures so systems can automatically detect the use of mobile phones and eating or drinking. For the system to correctly identify all types of visual, manual, and cognitive distractions, ARRK tested various CNN models with deep learning and trained them with the collected data.

Creation of the Image Dataset for Teaching the Systems

In the test setup, two cameras with active infrared lighting were positioned to the left and right of the driver on the A-column of a test vehicle. Both cameras had a frequency of 30 Hz and delivered 8-bit grayscale images at 1280 × 1024-pixel resolution. “The cameras were also equipped with an IR long-pass filter to block out most of the visual spectrum light at wavelengths under 780 nm,” said Wagner. “In this manner we made sure that the captured light came primarily from the IR LEDs and that their full functionality was assured during the day and at nighttime.” In addition, blocking visible daylight prevented shadow effects in the driver area that might otherwise have led to mistakes in facial recognition. A Raspberry Pi 3 Model B+ sent a trigger signal to both cameras to synchronize the moment of image capture.

Figures 1a and 1b. In the test setup, two cameras with active infrared lighting were positioned to the left and right of the driver on the A-column of a test vehicle. Both cameras had a frequency of 30 Hz and delivered 8-bit grayscale images at 1280 x 1024-pixel resolution. (Image courtesy of ARRK Engineering)

With this setup, images were captured of the postures of 16 test persons in a stationary vehicle. To generate a wide range of data, the test persons differed in gender, age, and headgear, as well as using different mobile phone models and consuming different foods and beverages. “We set up five distraction categories that driver postures could later be assigned to. These were: ‘no visible distraction,’ ‘talking on smartphone,’ ‘manual smartphone use,’ ‘eating or drinking,’ and ‘holding food or beverage,’” explained Wagner. “For the tests, we instructed the test persons to switch between these activities during simulated driving.” After capture, the images from the two cameras were categorized and used for model training.

Training and Testing the Image Classification Systems

Figures 2a (Left Camera) and 2b (Right Camera). For the experiment, the test persons were instructed to switch among different activities during simulated driving. After capture, the images from the two cameras were categorized and used for model training. (Image courtesy of ARRK Engineering)

Four modified CNN models were used to classify driver postures: ResNeXt-34, ResNeXt-50, VGG-16, and VGG-19. The last two models are widely used in practice, while ResNeXt-34 and ResNeXt-50 contain a dedicated structure for processing of parallel paths. To train the system, ARRK ran 50 epochs using the Adam optimizer, an adaptive learning rate optimization algorithm. In each epoch, the CNN model had to assign the test persons’ postures to the defined categories. With each step, this categorization was adjusted by a gradient descent method, so that the fault rate could be continuously lowered. After model training, a dedicated test dataset was used to calculate the error matrix, which allowed an analysis of the fault rate per driver posture for each CNN model. “The use of two cameras, each with a separately trained CNN model, enables ideal case differentiation for the left and right side of the face,” explained Wagner. “Thanks to this process, we were able to identify the system with the best performance in recognizing the use of mobile phones and consumption of food and beverages.” Evaluation of the results showed that the ResNeXt-34 and ResNeXt-50 models achieved the highest classification accuracy, 92.88 percent for the left camera and 90.36 percent for the right camera. This is competitive with existing solutions for detection of driver fatigue.

Using this information, ARRK has extended its training database, which now contains around 20,000 labeled eye data records. Based on this, it is possible to develop an automated vision-based system to validate driver monitoring systems. ARRK Engineering’s experts are planning another step to further reduce the fault rate. “In our next project, besides evaluation of different classification models, we will analyze whether the integration of associated object positions from the camera image can achieve further improvements,” said Wagner. Approaches will be considered that are based on bounding box detection and semantic segmentation. This will enable, in addition to classification, different levels of detail regarding the localization of objects.

This article was written by Benjamin Wagner, Senior Consultant for Driver Assistance Systems at ARRK Engineering (P+Z Engineering GmbH, Munich, Germany). For more information, contact Dr. Wagner at This email address is being protected from spambots. You need JavaScript enabled to view it. or visit here .