Vision for mutltimodal conversational interfaces and perceptive devices

Trevor Darrell

A classic goal for human computer interface is the conversational computer, which can freely interact with users through natural dialog. Advances in speech and language processing have made systems for single user conversation with a close-talking microphone almost commonplace–but when multiple speakers and noisy conditions are encountered, more information is needed. Visual cues can make conversational interfaces feasible in these environments, providing robust cues for speech and critical information about turn-taking, intent, and physical reference. In this talk I’ll review recent research in our lab on computer vision techniques which can provide these cues. I’ll describe work in progress on estimating pose-invariant mouth features for visual speechreading, head tracking for inferring turn-taking cues, agreement gestures, and conversational intent, and finally body pose estimation for resolving object deixis. Finally, I’ll describe our research on using vision to create perceptive mobile devices, which can use image appearance to recognize locations or objects in the user’s environment and present relevant information resources on the web or application specific databases. Together, these systems for visually-aware conversation and perceptive devices allow us to interact with complex computational resources using natural and intuitive gestures.