Incremental Focus of Attention
for Robust Visual Tracking


Despite attempts to make visual tracking resistant to mistracking (for example, against
background distractions), tracking inevitably fails under large visual perturbations including rapid unexpected motions, changes in ambient illumination, and occlusions. Many of these types of failures are unavoidable; even human visual systems cannot track an object moving quickly in the dark behind a wall.

We, therefore, proceed to consider a different problem. If it is not possible to avoid mistracking altogether, a tracking system should at least be able to recover tracking of an object it has lost. The Incremental Focus of Attention (IFA) architecture provides a structure which, when given the entire camera image to search, efficiently focuses the attention of the system into a narrow set of configurations that includes the target configuration.

IFA offers a means for automatic tracking initialization and reinitialization after environmental conditions momentarily deteriorate and cause the system to lose track of its target. Systems based on the framework degrade gracefully as various assumptions about the environment are violated. In particular, when constructed with multiple tracking algorithms of varying precision, the failure of a single algorithm causes another, less precise algorithm to take over, thereby allowing the system to return approximate information on feature location or configuration.

Experiments show that tracking systems based on the IFA framework are extremely robust to any type of temporary visual disturbance.

IFA Framework and Algorithm

The IFA framework is organized into layers. Layers take as input and return as output sets of object states. Each layer excises unlikely configurations from its input set and passes a smaller output set on to the next layer. The topmost layer outputs a small set of configurations that pinpoints the target state at the precision desired. Each layer is based on tracking algorithms or search heuristics.

At any instant in time, processing occurs only in a single layer. If the output set is believed to contain the target state, processing proceeds up the framework, and otherwise, processing reverts back down the layers. In this fashion, the system remains in the highest layer possible during any given environmental situation.

Layers are classified into selectors, which focus attention to a particular region of the configuration space, and trackers, which serve both to confirm that the object examined is the target object and to actually track an object once found.

The framework is illustrated below. Each trapezoid represents a single layer, with bases of the trapezoids representing input and output sets. On the right, the state transition graph that simplifies algorithm execution. Note that the states are divided into hunt and track states. In hunt mode, an IFA system concedes that it is not tracking the target object, but is performing a search to find it. In track mode, the system believes that it is tracking the target object.

Click for a larger image.
Click for a larger image.

Tracking a Face

We have developed systems for tracking robot tools for visual servoing experiments and for tracking textured planar objects. We have also implemented a system for tracking faces, potentially useful in human-computer interfaces as well as in compression techniques for videoconferencing (where knowledge of the location of the face helps to determine which parts of the image should be updated often).

Our face tracking system has 7 layers: Three selectors form the base of the system. Layers 0-2 focus attention on areas of the image which are flesh-colored and which are close to the last known position of the face. Layers 3-6 are all tracking layers. Layers 3 and 4 track a face based on color information alone. And the top two layers use a template-based algorithm to track a particular face. See our
visual tracking system for details on tracking algorithms.

Click for a larger image.          Click for a larger image.

These figures show tracking at two layers: Layer 2 (where "tracking" is based on the color of the target object, in this case, a face) and at the highest layer (a correlation-based pattern tracker). Tracking shifts between these two layers and other intermediate layers, as various visual events happen. Whenever the face is completely visible, oriented as shown, and moving at a reasonable speed, high-layer tracking occurs.

The net effect of the IFA face tracker is that it tracks at various degrees of precision depending on the difficulty of the tracking environment, and furthermore, after a tracking failure, it recovers completely when good tracking conditions return. The system has been demonstrated many times both in and out of its normal environment and has survived rigorous testing by skeptical audiences (for example, at the CVPR '96 Demo Session). The subjective experience of watching the system in action is that it is extremely robust.

Although no claims are made about any similarities to biological vision systems, IFA was inspired by mammalian visual strategies, and some observers have noted that the behavior of the tracking system seems almost human, both in method and robustness: While unable to resist temporary failure during occlusions, for example, the IFA system notices mistracking, makes saccadic serial searches for the target, and gracefully recovers after the disturbances cease.

Current work includes formalization of the notion of "robustness" in causal domains and incorporation of a Bayesian theory of attention for low-layer selective processes.

To find out more about this work, see Incremental Focus of Attention for Robust Visual Tracking (1.6MB), a longer version of a paper presented at CVPR '96.

Comments to


     Other Links

     Back to Top