The USC Andrew and Erna Viterbi School of Engineering USC Signal and Image Processing Institute USC Ming Hsieh Department of Electrical and Computer Engineering University of Southern California

Technical Report USC-SIPI-450

“Behavior Understanding from Speech under Constrained Conditions: Exploring Sparse Networks, Transfer and Unsupervised Learning”

by Haoqi Li

December 2020

The expression and perception of human behavioral signals play an important role in human interactions and social relationships. However, the computational study of human behavior from speech remains a challenging task since it is difficult to find generalizable and representative features because of noisy and high-dimensional data, especially when data is limited and annotated coarsely and subjectively. This dissertation focuses on the computational study of human behaviors via deep learning techniques.

Deep Neural Networks (DNN) have shown promise in a wide range of machine learning tasks, but for Behavioral Signal Processing (BSP) tasks, their application has been constrained due to limited quantity of data. In the first part of this dissertation, we propose a Sparsely-Connected and Disjointly-Trained DNN (SD-DNN) framework to deal with limited data. First, we break the acoustic feature set into subsets and train multiple distinct classifiers. Then, the hidden layers of these classifiers become parts of a deeper network that integrates all feature streams. The overall system allows for full connectivity while limiting the number of parameters trained at any time and allows convergence possible with even limited data. The results demonstrate the benefits in behavior classification accuracy.

An important cue of behavior analysis is the dynamical changes of emotions during the conversation. In the second part of this dissertation, we employ deep transfer learning to analyze inferential capacity and contextual importance between emotions and behaviors. We first train a network to quantify emotions from acoustic signals and then use information from the emotion recognition network as features for behavior recognition. We treat this emotion-related information as behavioral primitives and further train higher level layers towards behavior quantification. Through our analysis, we find that emotion-related information is an important cue for behavior recognition. Further, we investigate the importance of emotional-context in the expression of behavior by constraining (or not) the neural networks’ contextual view of the data. This demonstrates that the sequence of emotions is critical in behavior expression.

The results suggest that it is feasible to use emotion-related speech representation for behavior quantification and understanding. However, the representation learning for speech emotion recognition is also challenging. There is much variability from input speech signals, human subjective perception of the signals and emotion label ambiguity. In the third part of this dissertation, we propose a machine learning framework to obtain speech emotion representations by limiting the effect of speaker variability in the speech signals. Specifically we propose to disentangle the speaker characteristics from emotion through an adversarial training network in order to better represent emotion. Our method combines the gradient reversal technique with an entropy loss function to remove such speaker information. We show that our method improves speech emotion classification and increases generalization to unseen speakers.

Though we use a range of techniques for dealing with limited resources, domain specific data and entanglement of information, supervised behavioral modeling mostly relies on domain-specific construct definitions and corresponding manually-annotated data, rendering generalizing across domains challenging. In the last part of this dissertation, we exploit the stationary properties of human behavior within interactions and present a representation learning method to capture behavioral information from speech in an unsupervised way. We hypothesize that nearby segments of speech share the same behavioral context and hence map onto similar underlying behavioral representations. We present an encoder-decoder based Deep Contextualized Network (DCN) as well as a Triplet-Enhanced DCN (TE-DCN) framework to capture the behavioral context and derive a manifold representation, where speech frames with similar behaviors are closer while frames of different behaviors maintain larger distances. The models are trained on movie audio data and validated on diverse domains including on a couples therapy corpus and other publicly collected data (e.g. stand-up comedy). With encouraging results, our proposed framework also shows the feasibility of unsupervised learning within cross-domain behavioral modeling.

To download the report in PDF format click here: USC-SIPI-450.pdf (1.3Mb)