The USC Andrew and Erna Viterbi School of Engineering USC Signal and Image Processing Institute USC Ming Hsieh Department of Electrical and Computer Engineering University of Southern California

Technical Report USC-SIPI-413

“Automatic Qualification and prediction of Human Subjective Judgements in Behavioral Signal Processing”

by Matthew P. Black

February 2012

Human judgments on human behavior are an important part of interpersonal interactions and many assessment and intervention designs. While humans have evolved to be naturally adept at processing behavioral information, there are some challenges. Namely, human descriptions on behaviors are oftentimes qualitative, and there is variability between people's judgments due to the subjective nature of the judgment process.

Technology can help humans process behavioral data in a number of ways. Quantitative descriptors can be extracted from objective signals (e.g., audio, video) that represent aspects of human behavior in consistent and repeatable ways. There are many emerging engineering pursuits centered around modeling human behavior. Much of this research focuses on modeling specific human actions (e.g., head nods) during acted or non-spontaneous scenarios. Behavioral signal processing involves the development of computational methods that model human behavior in real-life scenarios. In this thesis, we automatically quantify and predict human subjective judgments on human behavior from speech signals in the context of societally-significant domain applications (education, family studies, health), where human observers play a critical role.

There are many technological challenges to quantifying and predicting human subjective judgments on human behavior. These include modeling several sources of variability, including the human behavior itself (heterogeneity) and the human evaluators themselves. There is a need to extract robust generalizable features that capture the human behavior and the relevant perceptual cues human evaluators are using. In addition, there is possibly information across multiple modalities/cues, and it is not always clear how humans weight them when making their judgments. Many relevant human judgments are "gist-like," based off a large amount of behavioral data. Thus, modeling the data at possibly multiple granularities is important, since some temporal regions may be more relevant than others and a particular cue's importance may vary as a function of time. Finally, since we are analyzing real data in real-life scenarios, the human behavior can be complex and the data can be non-ideal (e.g., noisy).

For this thesis, we focused on concrete problem domains that highlighted specific aspects of the technological challenges: literacy assessment, couples therapy research, and autism diagnosis. In the literacy assessment domain, we show that we can exploit human-inspired information into the computational framework for accurate modeling of evaluator's perception of children's overall reading ability for one specific reading task. We fused features that represented multiples aspects of the human behavior and robustly emulated human observational subjective processes by learning from individual and multiple evaluators' judgments. We also exploit the fact that evaluators' level of agreement significantly varies (depending on the child being judged) by incorporating this source of evaluator variability in the modeling framework. In the couples therapy research, we analyze a large corpus of spontaneous dyadic interactions between married couples and show we can predict six relevant high-level observational judgments (e.g., level of acceptance, global negative affect) using speaker-dependent acoustic speech features. Furthermore, we demonstrate one method for fusing automatically-derived speech and language information for improved classification of spouses' level of blame (high vs. low). Finally, we discuss our effort in collecting a multimodal corpus of child-psychologist interactions, recorded in the context of a social interaction used by psychologists for a research-level diagnosis of autism spectrum disorders. We highlight initial work with this corpus and discuss future experiments for the quantification of psychologists' clinical judgments on atypical social behavior (e.g., atypical prosody).

This thesis is on the development of a quantitative, automated framework that emulates human observational processes to describe human behavior from speech signals. We hope it makes impactful technological contributions to modeling complex human subjective processes. This work represents a significant step towards a shift in engineering from modeling and recognizing more objective human behaviors (e.g., speech recognition) to quantifying more subtle and abstract ones, a central theme to the emerging area of behavioral signal processing.

To download the report in PDF format click here: USC-SIPI-413.pdf (1.6Mb)