The USC Andrew and Erna Viterbi School of Engineering USC Signal and Image Processing Institute USC Ming Hsieh Department of Electrical and Computer Engineering University of Southern California

Technical Report USC-SIPI-455

“Improving Language Understanding and Summarization by Leveraging Auxiliary Information Through Self-Supervised or Unsupervised Learning”

by Karan Singla

December 2021

Spoken language understanding (SLU) is an exciting field of research which lies at the intersection between speech and language processing. It investigates human/machine, human/human, or machine/machine communication by leveraging information and technologies from natural language processing, signal processing, machine learning, pattern recognition and artificial intelligence. SLU systems are designed to extract the meaning from a spoken utterance and its applications are vast, from voice search in mobile devices, understanding intentions and behaviors to meeting summarization, attracting interest from both commercial and academic sectors. Understanding these human centered conversational data includes inferring underlying intent and behavior of the speaker.

Existing methods for SLU often require a costly pipeline of systems for understanding a spoken utterance. The typical pipeline based speech processing includes an automatic speech recognizer (ASR), which is used to transcribe speech into text. This text is then used by NLP pipelines for classification or regression tasks. Many different SLU tasks include information understanding, emotion and behavior understanding and use these speech processing pipelines for natural language understanding. However, there have been limited efforts for multimodal understanding of behavior primarily due to unavailability of End-2-End annotations (annotations which are done by listening to speech). Additionally these SLU pipelines fail to efficiently leverage useful auxiliary information from acoustic-prosodic cues, unsupervised clustering and also the gains from joint multitask training with language learning tasks.

In my work, I show that leveraging acoustic-prosodic information can aid lexical text for understanding behavior codes and propose methods for multimodal transcription-free prediction of them. I also propose novel methods for leveraging auxiliary information for learning text representations and summarization. First, I show that learning generic task representations which exploit additional monolingual data using joint multitask training can help generalize task related knowledge to a bigger vocabulary. Second, I propose factored extractive summarization techniques which can efficiently summarize spoken language by exploiting psycholinguistic information and topic models. We believe this provides enough evidence that End-2-End methods which leverage linguistic structure and exploit auxiliary information using self-supervision techniques enable multi-modal transcription-free understanding of behavior and efficiently summarize spoken language.

To download the report in PDF format click here: USC-SIPI-455.pdf (38.5Mb)