The USC Andrew and Erna Viterbi School of Engineering USC Signal and Image Processing Institute USC Ming Hsieh Department of Electrical and Computer Engineering University of Southern California

Technical Report USC-SIPI-426

“Emotional speech production: From data to computational models and applications”

by Jangwon Kim

December 2015

Speech is one of the most common and natural means of communication, conveying a variety of information, both linguistic and paralinguistic. The paralinguistic information is crucial in verbal communication, because rich meaning (e.g., nuance, tone) in spoken language, and the states (e.g., emotion, health, gender) and traits (e.g., personality) of the speaker are encoded and decoded in paralinguistic factors. These two aspects of information (linguistic and paralinguistic) are encoded into speech sound jointly and simultaneously by the actions of speech articulators. Hence, a better understanding of production aspects of speech can shed further light on the information encoding (and decoding) mechanism of verbal communication. This dissertation seeks a better understanding of the articulatory control strategy for the multi-layered information encoding process, as well as developing computational models for expressive speech production system. This describes my achievements in the research pathway on emotional speech production, including data collection, data processing, analysis, computational modeling and applications.The first is development of algorithms and software for data processing: (i) robust parameterization of magnetic resonance images and (ii) co-registration of real-time Magnetic Resonance Imaging (rtMRI) data and ElectroMagnetic Articulography (EMA) data. These algorithms allow automatic and robust extraction of articulatory information of interest from these speech production data.

The second is collection (and release) of the USC-EMO-MRI corpus: a novel multimodal database of emotional speech production, recorded using the rtMRI technology. This corpus is designed as a resource to study inter- and intra-speaker variability in both articulatory and acoustic signals of emotional speech.

The third is novel findings and insight on emotional speech production. The specific sub-topics are (i) the vocal tract shaping of emotional speech, (ii) articulatory variability of emotional speech, depending on the linguistic criticality of the articulator, and (iii) invariant properties and variation patterns in speech planning and execution components for emotional speech. Specifically, for (i) this dissertation investigates inter- and intra-speaker variability using the USC-EMO-MRI corpus. For (ii), this dissertation reports experimental results suggesting that the large variability of linguistically less critical articulators is an important source of emotional information, and its relationship with the controls of corresponding critical articulators. This also offers novel insight regarding the relationship, based on computational modeling and simulation experiments. For (iii), this offers novel findings on the invariant properties and variation patterns in the perspective of the Converter/Distributor model.

Finally, the fourth is development of a computational framework to predict rich articulatory information (anatomical point tracking, vocal tract shaping, morphology) from speech waveform. The articulatory information is extracted from production data recorded using multiple data acquisition modalities (rtMRI and EMA) after registering the data of different modalities. Deep learning model is used to learn the acoustic-to-articulatory (inverse) mapping. The benefit of using rich articulatory parameters for inversion mapping and emotion classification application is discussed.

To download the report in PDF format click here: USC-SIPI-426.pdf (2.5Mb)