Scene understanding is fundamental to many computer vision applications such as autonomous driving, robot navigation and human-machine interaction; visual object counting and localization are important building blocks of scene understanding. In this dissertation, we present 1) a framework that employs doubly stochastic Poisson (Cox) processes to estimate the number of instances of an object in an image and 2) a Bayesian model that localizes multiple instances of an object using counts from image sub-regions.
Poisson processes are well-suited to model events that occur randomly in space, such as the location of objects in an image or the enumeration of objects in a scene. Our Cox model takes as its input, the results from a CNN (convolutional neural net) classifier which has been trained to classify image sub-regions for the presence of the target object. A density function, estimated via inference from the Cox process, is then used to estimate the count. Despite the flexibility and versatility of Poisson processes, their application to large datasets is limited, as their computational complexity and storage requirements do not easily scale with image size. To mitigate this problem, we employ the Kronecker algebra, which takes advantage of the tensor product structure of covariance matrices. As the likelihood is non-Gaussian, the Laplace approximation is used for inference, employing the conjugate gradient and Newton’s method. Our approach has then close to linear performance. We demonstrate the counting results on both simulated data and real-world datasets, comparing the results with state-of-the-art counting methods.
In addition, we consider the problem of quickly localizing multiple instances of an object by asking questions of the form ``How many instances are there in this set?", while obtaining noisy answers. This setting is a generalization of the game of 20 questions to multiple targets. We evaluate the performance of a policy using the expected entropy of the posterior distribution after a fixed number of questions with noisy answers. We derive a lower bound for the value of this problem and study a specific policy, named the dyadic policy. We show that this policy achieves a value which is no more than twice this lower bound when answers are noise-free, and show a more general constant factor approximation guarantee for the noisy setting. We present an empirical evaluation of this policy on simulated data for the problem of detecting multiple instances of the same object in an image. Finally, we present experiments on localizing multiple objects simultaneously on real images.
Purnima Rajan is a Ph.D. candidate in the department of Computer Science at the Johns Hopkins University. She is advised by Prof. Gregory Hager and Prof. Bruno Jedynak, and is a member of the Computational Interaction and Robotics Lab. Her current research focuses on statistical modeling to improve object counting and localization in image data. Her research interests include computer vision, machine learning, and statistical methods applied to image analysis.