Spatiotemporal Oriented Energy Features for Visual Tracking

Contributors

  • Kevin Cannons
  • Richard P. Wildes

Overview

Target tracking is a critically important aspect to a wide range of computer vision applications, including surveillance, smart rooms, and human-computer interfaces. In general, to facilitate accurate tracking, features must be selected that distinguish targets from the background and from one another, even while being robust to photometric and geometric distortions. This work presents a novel feature set for visual tracking that is derived from "oriented energies". More specifically, energy measures are used to capture a target's multiscale orientation structure across both space and time, yielding a rich description of its spatiotemporal characteristics. To illustrate utility with respect to a particular tracking mechanism, we show how to instantiate oriented energy features efficiently within the mean shift estimator. Empirical evaluations of the resulting algorithm illustrate that it excels in certain important situations, such as tracking in clutter with multiple similarly colored objects and environments with changing illumination. Many trackers fail when presented with these types of challenging video sequences.

Why Spatiotemporal Oriented Energy Features?

There are a number of advantages of using the proposed oriented energy feature set for visual tracking:

  • Tracking can be completed in the presence of significant clutter as integrated spatial and temporal, multiscale, orientation analysis reveals distinctive target signatures.
  • Tracking can be performed throughout substantial illumination changes as oriented energies can be normalized to become invariant to local image contrast.
  • Computation of oriented energies involves only 3D separable convolution and pointwise nonlinear operations, hence making them amenable to compact, efficient implementation.

Oriented Energy Computation

Events in a video sequence will generate diverse structures in the spatiotemporal domain. For instance, a textured, stationary object produces a much different signature in image space-time than if the same object were moving. One method of capturing the spatiotemporal characteristics of a video sequence is through the use of oriented energies. These energies are derived using the filter responses of orientation selective bandpass filters when they are convolved with the spatiotemporal volume produced by a video stream. Responses of filters that are oriented parallel to the image plane are indicative of the spatial pattern of observed surfaces and objects (e.g., spatial texture); whereas, orientations that extend into the temporal dimension capture dynamic aspects (e.g., velocity and flicker).

For this work, filtering was performed using broadly tuned, steerable, separable filters based on the second derivative of a Gaussian and their corresponding Hilbert transforms. Filtering was executed across ten orientations and three scales using a Gaussian pyramid formulation. Hence, a measure of local energy, E, can be computed according to


  • G2 are Gaussian second derivative filters
  • H2 are the corresponding Hilbert transform filters
  • θ is the 3D orientation at which filtering is being performed
  • σ is the scale at which filtering is being performed
  • x = ( x, y, t ) corresponds to spatiotemporal image coordinates
  • Ι ( x ) is the image sequence

To attain a purer measure of energy that is more robust to illumination changes, normalization is performed, according to


where &epsilon is a bias term to avoid instabilities when the energy where content is small.

Pictorial Example of Oriented Energies

The figure below demonstrates the power and effectiveness of an oriented energy representation. The raw intensity image shows a white car moving to the left at a cluttered traffic intersection. When considering the derived energy images, the following observations can be made:

  • The channel tuned for horizontal structure captures the overall orientation structure of the white car.
  • While the channel tuned for vertical textures captures the outline of the crosswalks, it correctly shows little response to the car.
  • The energy channel that is tuned to leftward motion effectively distinguishes the white car from the static background.
  • The energies become more diffuse and capture more gross structure at the coarser scale.

Sample energies for a traffic video.
A single input frame from a traffic video sequence (top row) with corresponding sample energies (bottom two rows) computed at three orientations and two scales.

Histograms for Mean Shift Tracking

The proposed feature set should be widely applicable to visual tracking. In the present work, to illustrate their effectiveness with respect to a particular mechanism, we instantiate our features in the mean shift paradigm. Accordingly, we collapse the spatial information in our initial energy measurements and represent the target as a histogram. Each histogram bin corresponds to the weighted energy content of the target at a particular scale and orientation. The following diagram pictorially illustrates the process of generating an oriented energy histogram from a candidate target region.

Energy histogram construction.
Generating an oriented energy histogram for a candidate target region in an image.

Mathematically, target candidate histograms are defined as


  • k is the profile of the tracking kernel
  • y is the center of the target candidate's tracking window
  • h is the bandwidth of the tracking kernel
  • Ch is a normalization constant
  • xi* is a single target pixel at some temporal instant
  • Φu is the scale/orientation combination that corresponds to bin u of the histogram
  • Ê are the normalized energies.

The original target template histogram is defined in an analogous fashion. Having defined histogram representations for the target template and candidates when using oriented energy features, we simply employ these histograms in the standard mean shift formulation to efficiently estimate the spatial position of the tracking window in each frame.

Sample Results

The performance of the oriented energy-based mean shift tracker has been evaluated on an illustrative set of test sequences. For comparative purposes, a mean shift tracker based on RGB color space was also developed and tested. Other than the use of different histograms, the two trackers were identical. Notice how the proposed tracker outperforms the color-based tracker during illumination changes (first and second video), the tracking of a target similar in color to other scene objects (second video), and the tracking of a target in cluttered, low-quality footage (second video). Additional examples are provided in our paper.

Color tracker results Energy tracker results
This example demonstrates the robustness of the proposed oriented energy feature set to extreme illumination changes. Color and oriented energy-based results shown on left and right, respectively.

Color tracker results Energy tracker results
This example demonstrates the ability of the proposed tracker to follow a target that shares similar color characteristics to other scene objects throughout low-quality, cluttered video with numerous cast shadows. Color and oriented energy-based results shown on left and right, respectively.

Related Publications

π