Increasing regulatory concentration on improving the protection of vulnerable road users (VRUs) against vehicle collisions at night has led to new evaluations of proven imaging modalities that might quickly, effectively, and economically identify VRUs and measure their positions relative to moving vehicles.
Using just one thermal infrared camera and carefully trained convolutional neural networks (CNNs), the Thermal Ranger® system from Owl Autonomous Imaging (Fairport, NY) can locate and classify VRUs in the dark from their thermal signatures. Thermal imaging is well-suited for imaging after dark because it sees objects using their own emitted infrared energy rather than relying on illumination provided by streetlights or headlights. Further, thermal infrared passes through rain, fog, and other obscurants relatively unimpeded, providing useful images even in degraded conditions.
Imaging at Night
However, detection alone is not enough to support decision-making — each VRU must be classified by type and tagged with a distance from the vehicle. To accomplish both of these functions, the Owl system uses a complex of CNNs from a single thermal image to extract all the information required for automatic emergency braking decisions.
Detection, Recognition, and Identification (DRI)
The images supplied to the CNN must contain sufficient detail to allow extraction of the desired information. The level of information available from an image depends on how many sample points (pixels) span the important objects. Figure 1 shows the three levels that have been typically assigned — detection, recognition, and identification — based on conclusions drawn by a human observer using a video display. When the observer is a CNN, a fourth category becomes most important — classification — which is between recognition and identification, corresponding to about 10 pixels per meter. This resolution is sufficient to allow the CNN to discriminate between adults and children and to determine the pose of the pedestrian.
CNN Fundamentals
Convolutional neural networks are computer simulations of groups of neurons that can be trained to indicate when they see objects that correlate strongly with marked objects in a set of training images. Figure 2 shows typical training images with the objects to be detected indicated in color. The coloring is done manually by experienced technicians to assure that reliable image data is presented during training. Providing the CNN with labels to attach to training objects is called supervision.
When training starts, the CNN knows nothing, so its accuracy is very low, and its failure rate (loss) is very high. With additional training, the CNN improves in accuracy and the loss drops. Typically, the training data set is presented to the CNN many times, in sessions known as epochs, to reinforce the selectivity of the CNN.
However, it is possible that continued repetition of the training data can cause the CNN to recognize only those new images that very closely resemble the training images. This phenomenon, called overfitting, results from the CNN being shown the training images too many times. Attempting to make the CNN perform perfectly during training almost always leads to additional loss with new images. Care must be taken to stop training when the CNN operation is near its optimum.
Training the Thermal Ranger System
The Thermal Ranger system needs two types of scene information to formulate its report on the scene contents, the classification of each VRU with its location in the image and the distance from the camera to each classified object.
To begin the process, a CNN generates a map of the entire image, extracting at each pixel a value called “inverse depth,” which is inversely proportional to the distance of the object imaged in that pixel. To assure that the distances will be accurately reported in the final result, the CNN is trained on images having significant content within the range of useful distances where the true distance to this content is known.
Another CNN is trained on images of the objects to be classified so that bounding boxes can be placed around them to assign positions in the picture. Then the content of the bounding boxes can be segmented into pixels that cover the object and pixels that show the background.
The outputs of the two CNNs are then combined to assign distances to the pixels representing just the object. The result is a report on each object that contains its class, its position in the image, and its distance from the camera.
What the Thermal Ranger System Needs to See
It is desirable to have a system capable of recognizing and locating VRUs in clear as well as degraded visual environments, day and night. Recognizing and locating are two quite different functions having a certain hierarchy. The order of events goes something like this:
Accept raw data from a camera and perform any required normalization so that all objects fall within the analyzer dynamic range at all image locations.
Apply a CNN to the task of recognizing all objects of interest and identifying them according to type, this process is commonly referred to as classification.
Using the classification data, assign the locations for the sides of a two-dimensional bounding box and a class to each object.
Simultaneously with step 2, apply another CNN to produce a range map for the entire image, assigning to every pixel in the image a value representing the distance from the camera to the object imaged by that pixel.
Convert the values determined by the range CNN into distances in real units, typically meters.
Combine the information from the two CNNs to add a third dimension (depth) to the bounding boxes and assign a distance to the nearest face of the bounding box.
Use color to code the depth map for the entire image to produce an informative picture for display.
Assemble the type, location, size, and range of all objects of interest in the image for reporting to the equipment that will take appropriate action.
Figure 3 shows a block diagram of a sample implementation of this sequence.
This entire set of processes combining CNN and conventional computation is called an inference pipeline (IP). Partitioning the functions between the CNN and conventional sections is directed toward maximizing accuracy and reliability of the results. Since the Thermal Ranger system is intended for deployment on real vehicles in real situations, the entire IP is capable of field updating whenever better training data and conversion algorithms become available.
The Owl Thermal Ranger CNN in Action
Three steps in the classification and ranging process are illustrated in Figure 4. The original monocular thermal image provided to the IP is shown at the lower right. In the center is a bird’s eye view of the point clouds representing target objects extracted from the image. This data can be supplied to an automatic braking system when a hazard is detected. At the upper left is the driver view representation of the output image showing the recognized objects with bounding boxes and range labels set in their natural surroundings.
Figure 5 is an expansion of the driver view image from the upper left of Figure 4. In this view, the pedestrian thermal images can be seen with their associated 2D bounding boxes and range labels. Notice, on the left, the distinction in range between two pedestrians walking together. In this particular example, the software labels pedestrian ranges up to 50 meters with specific values while pedestrians outside that range are labeled as pedestrians (ped) but do not have range labels. Note that the automobile headlights in the distance are not found to be pedestrians and that other automobiles are detected and marked but not ranged.
Using thermal images, CNNs can provide critical information to automatic braking systems and to drivers both day and night to help reduce pedestrian accidents.
This article was written by Wade Appelman, Chief Business Officer at Owl Autonomous Imaging. For more information go here .