USC-SIPI REPORT #410

Technical Report USC-SIPI-410

“Contextual Modeling of Audio Signals Toward Information Retrieval”

by Samuel Kim

December 2010

Emotion has intrigued researchers for generations. This fascination has permeated the engineering community, motivating the development of affective computational models for classification. However, human emotion remains notoriously difficult to interpret both because of the mismatch between the emotional cue generation (the speaker) and perception (the observer) processes and because of the presence of complex emotions, emotions that contain shades of multiple affective classes. Proper representations of emotion would ameliorate this problem by introducing multidimensional characterizations of the data that permit the quantification and description of the varied affective components of each utterance. Currently, the mathematical representation of emotion is an area that is under explored.

Research in emotion expression and perception provides a complex and human-centered platform for the integration of machine learning techniques and multimodal signal processing towards the design of interpretable data representations. The focus of this dissertation is to provide a computational description of human emotion perception and combine this knowledge with the information gleaned from emotion classification experiments to develop a mathematical characterization capable of interpreting naturalistic expressions of emotion utilizing a data representation method called Emotion Profiles.

The analysis of human emotion perception provides an understanding of how humans integrate audio and video information during emotional presentations. The goals of this work are to determine how audio and video information interact during the human emotional evaluation process and to identify a subset of the features that contribute to specific types of emotion perception. We identify perceptually-relevant feature modulations and multi-modal feature integration trends using statistical analyses over the evaluator reports.

The trends in evaluator reports are analyzed using emotion classification. We study evaluator performance using a combination of Hidden Markov Models (HMM) and Na"{i}ve Bayes (NB) classification. The HMM classification is used to predict individual evaluator emotional assessments. The NB classification provides an estimate of the consistency of the evaluator's mental model of emotion. We demonstrate that evaluator reports created by evaluators with higher levels of estimated consistency are more accurately predicted than evaluator reports from evaluators that are less consistent.

The insights gleaned from the emotion perception and classification studies are aggregated to develop a new emotional representation scheme, called Emotion Profiles (EP). The design of the EPs is predicated on the knowledge that naturalistic emotion expressions can be approximately described using one or more labels from a set of basic emotions. Emotion profiles (EPs) are a quantitative measure expressing the degree of the presence or absence of a set of basic emotions within an expression. They avoid the need for a hard-labeled assignment by instead providing a method for describing the shades of emotion present in an utterance. These profiles can be used to determine a most likely assignment for an utterance, to map out the evolution of the emotional tenor of an interaction, or to interpret utterances that have multiple affective components. The Emotion-Profile technique is able to accurately identify the emotion of utterances with definable ground truths (emotions with an evaluator consensus) and is able to interpret the affective content of emotions with ambiguous emotional content (no evaluator consensus), emotions that are typically discarded during classification tasks.

The algorithms and statistical analyses presented in this work are tested using two databases. The first database is a combination of synthetic (facial information) and natural human (vocal information) cues. The affective content of the two modalities is either matched (congruent presentation) or mismatched (conflicting presentation). The congruent and conflicting presentations are used to assess the affective perceptual relevance of both individual modalities and the specific feature modulations of those modalities. The second database is an audio-visual + motion-capture database collected at the University of Southern California, the USC IEMOCAP database. This database is used to assess the efficacy of the EP technique for quantifying the emotional content of an utterance. The IEMOCAP database is also used in the classification studies to determine how well individual evaluators can be modeled and how accurately discrete emotional labels (e.g., angry, happy, sad, neutral) can be predicted given audio and motion-capture feature information.

The future directions of this work include the unification of the emotion perception, classification, and quantification studies. The classification framework will be extended to include evaluator-specific features (an extension of the emotion perception studies) and temporal features based on EP estimates. This unification will produce a classification framework that is not only more effective than previous versions, but is also able to adapt to specific user emotion production and perception styles.

Technical Report USC-SIPI-410

To download the report in PDF format click here: USC-SIPI-410.pdf (0.9Mb)