Coherent Image Regions for Matching and Registration

The philosophy behind this work is that coherent image regions provide a concise and stable basis for image representation: concise meaning that the required space for representing the image is small, and stable meaning that the representation is robust to changes in both viewpoint and photometric imaging conditions. On this webpage, we provide a brief overview into the work and refer the reader to the papers for more detailed information. Please contact us with any further questions.

We are interested in the problems of scene retrieval and scene mapping. Our underlying approach to solving these problems is region-based: a coherent region is a connected set of relatively homogeneous pixels in the image. For example, a red ball would project to a red circle in the image, or the stripes on a zebra's back would be coherent vertical stripes.

Our approach is a "middle ground" between the two popular approaches in image description: local region descriptors (e.g. Schmid and Mohr [PAMI, 1997] and Lowe [IJCV, 2004]) and global image segmentation (e.g. Malik et al [ICCV, 1999; PAMI, 2002]). We focus on creating interest operators for coherent regions and robust, but concise descriptors for the regions. To that end, we develop a sparse grouping algorithm that functions in parallel over several scalar image projections (feature spaces). We use kernel-based optimization techniques to create a continuous scale-space of the coherent regions. The optimization evaluates both the size (large regions are expected to be stable over widely disparate views) and the coherency (e.g. similar color, texture, etc). The descriptor for a given region is simply a vector of kernel-weighted means over the feature spaces. The description is concise, it is stable under drastic changes in viewpoint, and it is insensitive to photometric changes (given insensitive feature spaces).

We provide a brief explanation and some examples for the parts of this work: detection, matching, and registration

Detection and Description
We represent by a Gaussian kernel to facilitate continuous optimization techniques in detect (and registration). The kernels are applied to scalar projections of the image; the intuition is that various projection functions will map a region of consistent image content to a homogeneous image patch in the scalar field. Below, we show an image, its projection under neighborhood variance and the extracted regions. (The regions are drawn as ellipses corresponding to 3 standard deviations of the kernel.)

Here, we show the same image and its complete region-representation in color space.

We describe each region by the vector of kernel-weighted means under all projections that were used during detection. This representation is both concise and stable. We show a table comparing its memory footprint versus two other methods (Lowe's SIFT and Carson, Malik et al.'s Blobworld). These results are for our dataset (discussed below). We are grateful to both of the other groups for providing their source code/binaries which facilitated this analysis.
  Average Number of Elements Size per element (in Words) Average size (in Words)
Our Technique 159 5 797
Blobworld 9 239 2151
SIFT 695 32 22260
Next, we show empirical evidence that our approach is stable under gross viewpoint changes. We distort the input image by various affine transformations and detect the regions. From the figure below, we can see that roughly the same regions are detected. Note, that our current kernel is axis-aligned; in order to make the kernel fully affine-invariant we would have to add a 5 parameter for rotation and a 6 (a pre-kernel image warp) for image skew.

Currently, we use a simple nearest neighbor analysis on the regions' feature vector to measure similarity and voting (over the whole database) for image retrieval. Below is an example of both a positive and negative match.



To compare our approach to the other two approaches mentioned earlier, we performed an image retrieval experiment on a set of 48 images taken of the same, indoor scene. All images can be found here. Two images are considered matching if there is an pixel-area overlap. We use the standard precision (fraction of true-positive matches from all retrieved) and recall (fraction of matching images retrieved against the total possible matching images in the database). A sample of the database is below.

We see the SIFT performs the best for the standard retrieval experiment. This results agrees with those found by Mikolajczyk and Schmid (CVPR, 2003). Note that SIFT is storing substantially more information that both our method and Blobworld. Next, we compare out method to SIFT after distorting the query images drastically by halving the aspect ratio. We find that our method is very robust to such a change and surpasses SIFT in the precision-recall plot.



Created and Maintained by Jason Corso. © 2004,2005