Too little computing power, high prices and imprecise results put the brakes on early 3D systems in many applications. Thanks to improvements in computer performance and high-resolution sensors however, the technology is finding its way into more and more applications.

Whether it is the industrial smart robot in the age of the Industrial Internet of Things (IIoT), using three-dimensional data to orient itself in its working space, the reverse vending machine counting empty bottles in a case, or the surface inspection system alerting personnel to the smallest material defect, three-dimensional information acquired by modern 3D sensors from the environment will be central to many industrial applications of the future.

Currently, there are a variety of available technologies that can be used to collect three-dimensional information from a scene. One critical point of differentiation that must be made among them, however, is between active and passive techniques. Active techniques such as Lidar (Light detection and ranging) or time-of-flight sensors use an active light source in order to provide distance information; passive techniques, however, rely solely upon the camera-acquired image data, similar to depth perception in human visual systems.

Each of the techniques has advantages and disadvantages. While time-of-flight systems as a rule use less computational power and have few limitations in terms of scene structure, the maximum spatial resolution of current ToF systems (800 x 600 pixels) is relatively low and their outdoor use very limited due to infrared radiation from the sun. Although newer sensors on the market have enabled passive multi-view stereo vision systems for very high spatial resolution, they are processor-intensive and perform poorly when confronted with low-contrast or repeated textures. Nevertheless, today's computational resources as well as optional pattern projectors make real-time operation of stereo systems at high spatial and depth resolutions possible. Precisely for this reason, passive multi-view stereo systems are among the most popular and flexible systems for the acquisition of 3D information.

Multi-view stereo systems use two or more cameras that simultaneously record data from a scene. When the cameras are calibrated and focused on a real-world point in the scene, whose pixels can be located in the camera, a three-dimensional feature can be reconstructed from the pixels via triangulation. The highest possible level of precision that can be obtained depends on the distance between the cameras (baseline), the convergence angle between the cameras, the sensor's pixel size, and the focal length. The essential aspects of calibration and correspondence-matching make great demands on the underlying image processing algorithms.

Example detection results from a calibration pattern in various positions and directions. Via the detected control points from the calibration pattern, the camera's internal and external parameters can be determined.

Stereo Vision Systems in Real-time Use

Through camera calibration, the position and orientation of the individual cameras can be determined (external parameters) as well as the focal length, principal point, and distortion parameters (internal parameters), which are significantly influenced by the selected lenses.

Camera calibration is usually performed by using a two-dimensional calibration pattern such as a checkerboard or dots in which control points can be easily and clearly detected. The measurements of the calibration pattern, such as the distances between control points, are precisely known. Next, image sequences of the calibration patterns (with varying positions of pattern and orientation) are made. Image processing algorithms detect the control points in the calibration pattern from the individual images. Edge and corner detection algorithms serve, for example, as the basis when using a checkerboard pattern, and blob detection algorithms, when using a dot calibration pattern. In so doing, a multitude of 3D-2D correspondences between the calibration object and the individual images emerge. Based on these correspondences, an optimization process delivers the camera parameters.

While the calibration is run only once (assuming the camera parameters do not change during system operation), the significantly more processor-intensive task of finding correspondences between the views must be carried out for each image in order to deliver the scene's 3D information. In the case of a stereo system, correspondences between two views are identified. In preprocessing, the images are usually rectified according to the internal distortion parameters. For a pixel in the reference image, there will be a search for the corresponding point in the target image that represents the same 3D coordinate in the observed scene. Assuming Lambertian reflectance (i.e. a perfectly diffused surface), local regions in the reference and target images should be very similar. The correlation is computed between the source and the target region in order to indicate similarity. This is not the same as computing the correlation coefficients beforehand and comparing them afterwards (normalized cross-correlation is well-established). The normalized cross-correlation is one such similarity measure.

Correspondence Points

Not all available scene points are needed for the target image: geometrically, there are potential corresponding points that lie in the rectified views on a line, a so-called epipolar line. Correspondences need only be searched for along these epipolar lines. In order to additionally accelerate the search, undistorted input images are often rectified. The input images are transformed so that all corresponding epipolar lines share the same vertical image coordinates. Accordingly, for any given point in the reference image, one need only search along the line with the same vertical coordinate when looking for correspondences in the target image. While the algorithmic complexity of the search remains the same, the previous rectification allows for a more efficient search for correspondences. Furthermore, if the minimum and maximum working distances of the scene are known, the search can be additionally refined along the epipolar lines in order to accelerate the process.

If all possible target environments along the epipolar lines have been compared with the reference environments, the target environment with the greatest similarity is, as a rule (in the case of local stereo algorithms), selected as the final correspondence. If the correspondence search is complete, assuming that a clear correspondence has been found, for every pixel of the reference image in a rectified stereo vision system, there will be distance information in the form of the disparity as measured by the offset in pixels along the epipolar line. This is the disparity image, or disparity map.

Above: original image pair from The Imaging Source's stereo vision system. Below: rectified image pair. For a point in the reference image (below, left), a corresponding point need only be searched for along the same image line in the target image (shown lower right as a red line for demonstration purposes).

With the help of the previously calibrated internal and external parameters, the disparity can in turn be converted into actual metric distance information. If the distance for every point is calculated, where a disparity could be estimated, the result is a three-dimensional model in the form of what is known as a point cloud. In the case of low-contrast or repetitive patterns in a scene, the use of local 3D stereo techniques can lead to less reliable disparity calculations since many points with a low uniqueness value will exist in the target view. Global stereo techniques can help in such cases but are considerably more processor-intensive as they place additional demands on the final disparity card in the form of a smoothness constraint that penalizes discontinuities. It is often easier to project an artificial structure onto the object so as to produce clarity in the correspondences (projected texture stereo). However, the projector is not calibrated with reference to the camera since it serves only as a source of artificial structure.

Visualization of the disparity estimate and the final point cloud using an SDK from The Imaging Source. Left: disparity map relative to the reference image. Middle: 3D view of the texturized point cloud. Right: color-coded point cloud, which shows distance from the camera.

Acceleration via Graphics Processing Units (GPUs)

When high frame rates and high spatial resolution are needed, modern GPUs calculate 3D information at significantly accelerated speeds. For the final integration of a stereo vision system in an existing environment, The Imaging Source, LLC, relies on modular solutions: the acquisition of 3D data can be achieved using either The Imaging Source's own C++ SDK with optional GPU acceleration in connection with cameras from The Imaging Source or MVTec's HALCON programming environment. While SDK allows for easy calibration of stereo vision systems as well as the acquisition and visualization of 3D data, HALCON offers additional modalities such as hand-eye calibration for the integration of robotic systems and algorithms such as the registration of CAD models in relation to acquired 3D data.

This article was written by Dr. Oliver Fleischmann, Project Manager at The Imaging Source, LLC (translated by Amy Groth) (Charlotte, NC). For more information, contact Dr. Fleischmann at This email address is being protected from spambots. You need JavaScript enabled to view it. or visit here  .



Magazine cover
Photonics & Imaging Technology Magazine

This article first appeared in the January, 2018 issue of Photonics & Imaging Technology Magazine (Vol. 42 No. 1).

Read more articles from this issue here.

Read more articles from the archives here.