Whether it is the industrial smart robot in the age of the Industrial Internet of Things (IIoT), using three-dimensional data to orient itself in its working space, the reverse vending machine counting empty bottles in a case, or the surface inspection system alerting personnel to the smallest material defect, three-dimensional information acquired by modern 3D sensors from the environment will be central to many industrial applications of the future.
Currently, there are a variety of available technologies that can be used to collect three-dimensional information from a scene. One critical point of differentiation that must be made among them, however, is between active and passive techniques. Active techniques such as Lidar (Light detection and ranging) or time-of-flight sensors use an active light source in order to provide distance information; passive techniques, however, rely solely upon the camera-acquired image data, similar to depth perception in human visual systems.
Each of the techniques has advantages and disadvantages. While time-of-flight systems as a rule use less computational power and have few limitations in terms of scene structure, the maximum spatial resolution of current ToF systems (800 x 600 pixels) is relatively low and their outdoor use very limited due to infrared radiation from the sun. Although newer sensors on the market have enabled passive multi-view stereo vision systems for very high spatial resolution, they are processor-intensive and perform poorly when confronted with low-contrast or repeated textures. Nevertheless, today's computational resources as well as optional pattern projectors make real-time operation of stereo systems at high spatial and depth resolutions possible. Precisely for this reason, passive multi-view stereo systems are among the most popular and flexible systems for the acquisition of 3D information.
Multi-view stereo systems use two or more cameras that simultaneously record data from a scene. When the cameras are calibrated and focused on a real-world point in the scene, whose pixels can be located in the camera, a three-dimensional feature can be reconstructed from the pixels via triangulation. The highest possible level of precision that can be obtained depends on the distance between the cameras (baseline), the convergence angle between the cameras, the sensor's pixel size, and the focal length. The essential aspects of calibration and correspondence-matching make great demands on the underlying image processing algorithms.
Stereo Vision Systems in Real-time Use
Through camera calibration, the position and orientation of the individual cameras can be determined (external parameters) as well as the focal length, principal point, and distortion parameters (internal parameters), which are significantly influenced by the selected lenses.
Camera calibration is usually performed by using a two-dimensional calibration pattern such as a checkerboard or dots in which control points can be easily and clearly detected. The measurements of the calibration pattern, such as the distances between control points, are precisely known. Next, image sequences of the calibration patterns (with varying positions of pattern and orientation) are made. Image processing algorithms detect the control points in the calibration pattern from the individual images. Edge and corner detection algorithms serve, for example, as the basis when using a checkerboard pattern, and blob detection algorithms, when using a dot calibration pattern. In so doing, a multitude of 3D-2D correspondences between the calibration object and the individual images emerge. Based on these correspondences, an optimization process delivers the camera parameters.
While the calibration is run only once (assuming the camera parameters do not change during system operation), the significantly more processor-intensive task of finding correspondences between the views must be carried out for each image in order to deliver the scene's 3D information. In the case of a stereo system, correspondences between two views are identified. In preprocessing, the images are usually rectified according to the internal distortion parameters. For a pixel in the reference image, there will be a search for the corresponding point in the target image that represents the same 3D coordinate in the observed scene. Assuming Lambertian reflectance (i.e. a perfectly diffused surface), local regions in the reference and target images should be very similar. The correlation is computed between the source and the target region in order to indicate similarity. This is not the same as computing the correlation coefficients beforehand and comparing them afterwards (normalized cross-correlation is well-established). The normalized cross-correlation is one such similarity measure.
Not all available scene points are needed for the target image: geometrically, there are potential corresponding points that lie in the rectified views on a line, a so-called epipolar line. Correspondences need only be searched for along these epipolar lines. In order to additionally accelerate the search, undistorted input images are often rectified. The input images are transformed so that all corresponding epipolar lines share the same vertical image coordinates. Accordingly, for any given point in the reference image, one need only search along the line with the same vertical coordinate when looking for correspondences in the target image. While the algorithmic complexity of the search remains the same, the previous rectification allows for a more efficient search for correspondences. Furthermore, if the minimum and maximum working distances of the scene are known, the search can be additionally refined along the epipolar lines in order to accelerate the process.
If all possible target environments along the epipolar lines have been compared with the reference environments, the target environment with the greatest similarity is, as a rule (in the case of local stereo algorithms), selected as the final correspondence. If the correspondence search is complete, assuming that a clear correspondence has been found, for every pixel of the reference image in a rectified stereo vision system, there will be distance information in the form of the disparity as measured by the offset in pixels along the epipolar line. This is the disparity image, or disparity map.