DARPA’s Mind’s Eye Program aims to develop a smart camera surveillance system that can autonomously monitor a scene and report back human-readable text descriptions of activities that occur in the video. An important aspect is whether objects are brought into the scene, exchanged between persons, left behind, picked up, etc. While some objects can be detected with an object-specific recognizer, many others are not well suited for this type of approach. For example, a carried object may be too small relative to the resolution of the camera to be easily identifiable, or an unusual object, such as an improvised explosive device, may be too rare or unique in its appearance to have a dedicated recognizer. Hence, a generic object detection capability, which can locate objects without a specific model of what to look for, is used. This approach can detect objects even when partially occluded or overlapping with humans in the scene.
The first step in the generic object detection algorithm is to learn a model of the scene background. New video frames are then compared against the background model; regions that deviate from the background model are identified as foreground regions. Separately, a human detection and pose estimation algorithm is applied to each video frame to directly locate humans based on appearance, yielding a set of 2D skeletons and associated likelihood scores. The 2D skeletons are upgraded into full 3D pose estimates through a nonlinear optimization procedure. The posed 3D humanoid model can then be projected through a camera model to give a predicted silhouette in the image plane for each detected person. Regions of the foreground that are not explained by the humanoid projections are labeled as potential objects. Temporal analysis (tracking) can be used to disambiguate real objects from false alarms resulting from imperfect frame-by-frame pose estimation.
This work was done by Michael C. Burl, Russell L. Knight, and Kimberly K. Furuya of Caltech for NASA’s Jet Propulsion Laboratory.