“Improved Real-Time Magnetic Resonance Imaging of Speech Production”
by Yongwan Lim
December 2020
Human speech is a unique capability that involves complex and rapid movement of vocal tract articulators. To understand the sounds of speech, it is important to see and understand how the different parts of the vocal articulators move to produce sounds. In this sense, real-time magnetic resonance imaging (RT-MRI) has provided powerful insight into speech production because of its ability to non-invasively and safely capture the essential dynamic features of the vocal tract during the speech. RT-MRI has initiated a dramatic scientific change in the nature of speech production research, including an understanding of language, improved speech synthesis and recognition, and several clinical applications. Despite the great success of RT-MRI in the study of speech production, there would be still unmet needs in improving the quality and quantity of imaging information about the dynamics of vocal tract articulators. This dissertation introduces new tools for RT-MRI of speech production that o er steps toward a better understanding of speech production. First, I develop a model-based deblurring method for spiral RT-MRI of speech production. This technique estimates and corrects for dynamic off-resonance that appears as spatially and temporally varying blurring in the image. This method is possible to estimate a dynamic eld map directly from the phase of single echo-time dynamic images after a coil phase compensation, and I demonstrate this method can be directly applied to an existing multi-speaker dataset of running speech. I demonstrate improvements in the depiction and tracking of air-tissue articulator boundaries quantitatively using an image sharpness metric, and using visual inspection, and the practical utility of this method on a use case. Second, I develop a data-driven deblurring method for spiral RT-MRI of speech production. A 3-layer residual convolutional neural network is present to correct image domain off-resonance artifacts without the knowledge of eld maps. The mathematical connection between conventional deblurring methods and the proposed network architecture is derived. I propose a model-based framework that leverages the previous model-based method to generate training data with some augmentation strategy. I validate the proposed method using synthetic and real in vivo data with longer readouts, quantitatively using image quality metrics and qualitatively via visual inspection, and with a comparison to conventional methods. Finally, I develop a new 3D RT-MRI technique for imaging the full 3D vocal tract at high temporal resolution during a natural speech. This technique utilizes an efficient golden-angle stack-of-spirals sampling, undersampling scheme, and constrained reconstruction. I evaluate this technique through in vivo imaging of natural speech production from two subjects and via comparison with interleaved multislice 2D RT-MRI. This promising tool for speech science for the first time enables a quantitative identification of spatial and temporal coordination of important tongue gestures coproduced on and o the midline in the articulation of English consonants /l/ and /s/ via volume-of-interest analysis and allows a direct assessment of vocal tract area function dynamics during natural speaking of utterances.