Applying Vision to Intelligent Human-Computer Interaction

As powerful and affordable computers and sensors become virtually omnipresent, constructing highly intelligent and convenient computation systems has never been so promising. Vision holds great promise in building advanced human-computer interaction systems. We investigate various techniques to integrate passive vision into different interaction environments.

First, we propose a novel approach to integrate visual tracking into a haptics systems. Traditional haptic environments require that the user must be attached to the haptic device at all times, even though force feedback is not always being rendered. We design and implement an augmented reality system called VisHap that uses visual tracking to seamlessly integrate force feedback with tactile feedback to generate a ``complete’’ haptic experience. The VisHap framework allows the user to interact with combinations of virtual and real objects naturally, thereby combining active and passive haptics. The flexibility and extensibility of our framework is promising in that it supports many interaction modes and allows further integration with other augmented reality systems.

Second, we propose a new methodology for vision-based human-computer interaction called the Visual Interaction Cues (VICs) paradigm. VICs is based on the concept of sharing perceptual space between the user and the computer. Each interaction component is represented as a localized region in the image(s). We propose to model gestures based on the streams of extracted visual cues in the local space, thus avoiding the problem of globally tracking the user(s). Efficient algorithms are proposed to capture hand shape and motion. We investigate different learning and modeling techniques including neural networks, Hidden Markov Models and Bayesian classifiers to recognize postures and dynamic gestures.

Since gestures are in essence a language with individual low-level gestures analogous to a word in conventional languages, a high-level gesture language model is essential for robust and efficient recognition of continuous gestures. To that end, we have constructed a high-level language model that integrates a set of low-level gestures into a single, coherent probabilistic framework. In the language model, every low-level gesture is called a gesture word and a composite gesture is a sequence of gesture words, which are contextually and temporally constrained. We train the model via supervised or unsupervised learning techniques. A greedy inference algorithm is proposed to allow efficient online processing of continuous gestures. We have designed a large-scale gesture experiment that involves sixteen subjects and fourteen gestures. The experiment shows the robustness and efficacy of our system in modeling a relative large gesture vocabulary involving many users. Most of the users also consider our gesture system is comparable or more natural and comfortable than traditional user interfaces with a mouse.