The USC Andrew and Erna Viterbi School of Engineering USC Signal and Image Processing Institute USC Ming Hsieh Department of Electrical Engineering University of Southern California

Technical Report USC-SIPI-411

“Visualizing and Modeling Vocal Production Dynamics ”

by Erik Bresch

May 2011

Understanding human speech production is of fundamental importance for basic and applied research in human communication: from speech science and linguistics to clinical and engineering development. While the vocal tract posture and movement can be investigated using a host of techniques, the newly developed real time (RT)-magnetic resonance imaging (MRI) technology has a particular advantage - it produces complete views of the entire moving vocal tract including the pharyngeal structures in a non-invasive manner. RT-MRI promises a new means for visualizing and quantifying the spatio-temporal articulatory details of speech production and it also allows for exploring novel data-intensive, machine learning based computational approaches to speech production modeling. The central goal of this thesis is to develop new technological capabilities and to use these novel tools for studying human vocal tract shaping during speech production. The research, which is inherently interdisciplinary, combines technological elements (to design engineering methods and systems to acquire and process novel speech production data), experimental elements (to design linguistically meaningful studies to gather useful insights) and computational elements (to explain the observed data and design predictive capabilities). In Chapter 1, which was in part published in [6], the use of RT-MRI as an emerging technique for speech production research studies is motivated. An outline is provided of the biomedical image acquisition and image processing challenges, potentials, and opportunities arising with the use of RT-MRI.

The second part, Chapter 2, describes novel hardware technology and signal processing algorithms which were developed to facilitate synchronous speech audio recordings during RT-MRI scans. Here, the main problem lies in the loud noise produced by the MRI acquisition process. The proposed solution incorporates digital synchronization hardware and an adaptive signal processing algorithm which allows the acquisition of speech audio with satisfactory quality for further analysis. This enables joint speech-image data acquisition that in turn allows for joint modeling of articulatory-acoustic phenomena. Most of this chapter was published in [9].

Subsequently, Chapter 3 addresses the extraction of relevant geometrical features from the vast stream of magnetic resonance (MR) images. In the case of the commonly used midsagittal view of the human vocal tract the geometrical features of interest are the locations of the articulators, and hence the underlying image processing problem to be solved is that of edge detection. Further complications arise from the poor MR image quality, which is compromised by the inherent trade-off between spatial, temporal resolution, and signal to noise ratio. A solution to the edge detection problem will be devised using a deformable geometrical model of the human vocal tract. Mathematically the proposed procedure relies on designing alternate gradient vector flows for the solution of a non-linear least squares optimization problem. With the new method the human vocal tract outline can be traced automatically. These findings were published in [7].

Chapters 4 and 5 describe two vocal production studies using articulatory vocal tract data. The first study investigates 5 soprano singers’ static vocal tract shaping during the singing production of vowel sounds, and it considers the much-researched theory of resonance tuning. The study successfully validates the usefulness of RT-MRI data and the data processing methods of Chapters 2 and 3. The second study focuses on the tongue shaping of English sibilant fricative sounds, and reproduces previously known findings with the new RT-MRI modality. The findings of these two studies have been published in [8, 10].

The last part of this thesis is contained in Chapter 6 and it proposes a statistical framework for the modeling of articulatory speech data. Here, the main focus lies on the coupled hidden Markov model (CHMM) as a candidate system to capture the dynamics of the multi-dimensional vocal tract shaping process. It is demonstrated that using this methodology it is possible to capture in a data driven way the well-known timing signatures of the velum-oral coordination of English nasal sounds in word onset and coda positions. The content of this chapter has been published in [5]. This thesis is concluded with a brief summary of the contributions and a discussion of possible future research directions in Chapter 7.

To download the report in PDF format click here: USC-SIPI-411.pdf (2.7Mb)