Machine Vision Fundamentals: How to Make Robots ‘See’

Machine vision combines a range of technologies to provide useful outputs from the acquisition and analysis of images. Used primarily for inspection and robot guidance, the process must be done reliably enough for industrial automation. This article provides an introduction of how today’s machine vision technology guides basic robotic functions.

Let’s go through a simple example of what happens during robot guidance. Take, for example, a stationary mounted camera, a planar work surface, and a screwdriver that must be grasped by the robot. The screwdriver may be lying flat on that surface and mixed amongst, but not covered by, other items. The key steps executed during each cycle include:

Acquire a suitable image.
“Find” the object of interest (the overall screwdriver, or the piece of it that must be grabbed.)
Determine the object’s position and orientation.
Translate this location to the robot’s co-ordinate system.
Send the information to the robot.
Using that information, the robot can then move to the proper position and orientation to grasp the object in a prescribed way.

While the machine vision portion (steps #1 through #5) may appear lengthy when explained, the entire sequence is usually executed within a few hundredths of a second.

#1 Acquire a suitable image: Several machine vision tools are described below. Each of these software program components operates on an image and requires differentiation to “see” an object. This differentiation may be light vs. dark, color contrast, height (in 3D imaging), or transitions at edges. Note: It’s important to confirm or design a geometric solution so that lighting creates reliable differentiation.

The choices of imaging methods vary fundamentally. The most common are gray scale and color versions of area scan imaging, which simply means a conventional picture taken and processed all at once. Less common options are line scan imaging, where the image is built during motion, one line at a time, and 3D profiling, where the third dimension of an image (“Z”) is coded into the value of each pixel of the figure.

Figure 2. Enhanced image, with found templates marked with yellow rectangles.

Points on a plane of interest vary in their distance from the camera, changing their apparent size; this issue is accentuated when the camera aim is not perpendicular to the surface. Optics may introduce barrel or pincushion distortion. Barrel distortion bulges lines outward in the center, like the lines or staves on a wooden barrel; pincushion does the opposite. A distortion correction tool is often used to remove these flaws. During a “teaching” stage, a known accurate array (such as a rectangular grid of dots) is placed at the plane of interest. The tool views the (distorted) image, and determines the image transformation required to correct it. During the “run” phase, this transformation is executed on each image.

#2 Find the object of interest: “Finding” the object requires creating a distinction between the object of interest and everything else that is in the field of view, including the background (such as a conveyor) or other objects. Here are some common methods:

Template matching: A template-matching tool is shown and trained on one or more images of the item of interest, like the round clips of the assembly in Figure 1 and 2. It may learn the entire image of the part, or certain features such as the geometry of the edges. During operation, the technology searches the field of view for a near-match to what it “learned.” There are various images and mathematical processing methods (such as normalized correlation) to accomplish each. Those based on edge geometries offer advantages for partially occluded objects or a scale-invariance option when the camera’s distance from the object is variable. When the degree of match exceeds a minimum threshold, the object is “kept.” Figure 2 shows these results, where the software tool has found two clips that met the matched criteria and marked them with yellow rectangles.
Differentiation based on brightness: This method includes determining a brightness “threshold” on a gray scale image such that everything above or below that value is the object of interest (i.e. light objects on a dark background or vice versa). Most commonly this is a value between 0 and 255 corresponding to the 256 levels available in 8 bit coding for each pixel. The threshold value may be fixed, or it may adapt to varying light levels via a simple (average gray level) or complex (histogram-based) algorithm. The threshold is applied to the image, separating the object(s) of interest.
Differentiation based on color: Color is best addressed by transforming each pixel of the image to “distance” from the trained color sample set in 3-axis color space. Color representation methods usually characterize a color by 3 coefficients. RGB (Red, Green, Blue) is common and native to most imaging and display processes. Triplets of coefficients require a three dimensional graph called a “color space.” R, G, B are each located on an axis orthogonal to each other, for example. “Distance” between the points representing two colors in this space is the three dimensional Pythagorean distance ([(R₁ - R₂)² + (G₁ - G₂)² + (B₁- B₂)²]^.5) between them. The “trained color” can be that of either the desired object or the background.

Figure 3. Color image. The desired object is the screwdriver handle.

Figure 3 shows the original image before a color tool is trained on the shades of red present in the screwdriver handle. Execution on the color image transforms it into the synthetic image shown in Figure 4. The shade of each pixel represents the distance in 3D color space (the closeness of the match) to the trained color. Finally, in Figure 5, the handle is uniquely objectified using thresholding, marking the object green on the display.

Differentiation based on height: This technique is used on images where the third dimension is scanned and coded into the pixel values as previously described. This synthetic image may then be processed in the same manner.

All methods may retain multiple eligible objects, in which cases a choice will need to be made between them on some additional criteria, such as “first in line.”

#3 Determine the position and orientation of the object: For our example, the results of this stage are the x and y co-ordinates of the object and the angle of its orientation. Sometimes this function is performed as part of the previous “find” procedure. For example, a template-match tool might supply position and orientation data on the part which it has located. The addition of simple software tools that provide feature computation or geometric analysis will generally complete this task.

Figure 4. Transformation of color image based on 3D color space distance.

#4 Translate the information to the coordinate system of the robot: The vision system and the robot each innately have their own co-ordinate system to represent location, an orthogonal “x” and “y.” To communicate to the robot, one must translate to/from the other; this is usually handled by the vision system.

Besides permanent innate differences, other small errors may get introduced. A simple addition or subtraction of a correction factor from the x and y values can provide first order correction and translation for these factors. A tool designed for this purpose operates in two modes: a “learn/calibrate” mode (where the robot may be stopped with a target on it in view of the camera) and a run mode when the correction or translation is applied. X and y offsets between the systems are set during a calibration sequence, and applied to the measurement during running.

Figure 5. Handle objectified using thresholding, marked in green.

#5 This information is sent to the robot controller: Interfaces are visualized as having “layers,” each of which must be matched between the two systems. The bottom layers are the familiar general types (typically Ethernet or RS232). The top layers are the format and sequence protocol for the data itself and its transfer. Having one side of the link (the robot) define this as a rigid proprietary protocol is still common. When this is the case, the protocol should be stable and documented; then the machine vision supplier often creates a custom translator to that “language.”

#6 The robot uses this information to move to the correct position and orientation to grasp the object: The vision system tells the robot (specifically, the robot controller) where to go, not how to get there. In other configurations, (especially with robot-mounted cameras) the vision may continue to operate during the move to provide feedback for higher accuracy.

Conclusion

Using a simplified example application, we can see the basic steps of how machine vision guides robots. Most applications have additional complexities in one or more areas. Many (such as when the part is moving on a conveyor, and when the camera is mounted on the robot itself) are common and addressed by additional technologies, tools and methods which are currently available.

This article was written by Fred D. Turek, COO at FSI Technologies, Inc. (Lombard, IL). For more information, Click Here .