“Speech Recognition Error Modeling For Robust Speech Processing and Natural Language Understanding Applications”
by Prashanth Gurunath Shivakumar
May 2021
Automatic Speech Recognition (ASR) is gaining a lot of importance in everyday life. ASR has become a core component of human computer interaction. It is a key part of many applications involving virtual assistants, voice assistants, gaming, robotics, natural language understanding, education, communication-pronunciation tutoring, call routing, interactive media entertainment, etc. The growth of such applications and their adaptations in everyday scenarios, points to the ASR becoming an ubiquitous part of our daily life in the foreseeable, near future. This has become partly possible due to high performance achieved by state-of-the-art speech recognition systems. However, the errors resulting from ASR can often have a negative impact towards the downstream applications. In this work, we focus on modeling the errors of the ASR with the hypothesis that an accurate modeling of such errors can be used to recover from the ASR errors and alleviate the negative consequences towards its downstream applications. We model the ASR as a phrase-based noisy transformation channel and propose an error correction system that can learn from the aggregate errors of all the independent modules constituting the ASR and attempt to invert those. The proposed system can exploit long-term context and can re-introduce previously pruned or unseen phrases in addition to better choosing between existing ASR output possibilities. We show that the system can provide improvements over a range of different ASR conditions without degrading any accurate transcription. We also show that the proposed system provides consistent improvements even on out-of-domain tasks as well as over highly optimized ASR models re-scored by recurrent neural language models. Further, we propose sequence-to-sequence neural network for modeling the ASR errors by incorporating much longer contextual information. We propose different optimal architectures and feature representations, in terms of subwords, and demonstrate improvements over the phrase-based noisy channel model. Additionally, we propose a novel word vector representation, Confusion2Vec, motivated from the human speech production and perception that encodes representational ambiguity. The representational ambiguity of acoustics, which manifests itself in word confusions, is often resolved by both humans and machines through contextual cues. We present several techniques to train an acoustic perceptual similarity representation ambiguity and learn on unsupervised-generated data from ASR confusion networks or lattice-like structures. Appropriate evaluations are formulated for gauging acoustic similarity in addition to semantic-syntactic and word similarity evaluations. The Confusion2Vec is able to model word confusions efficiently without compromising on the semantic-syntactic word relations, thus effectively enriching the word vector space with extra task relevant ambiguity information. The proposed Confusion2Vec can also contribute and extend to a range of representational ambiguities that emerge in various domains further to acoustic perception, such as morphological transformations, word segmentation, paraphrasing for natural language processing tasks like machine translation, and visual perceptual similarity for image processing tasks like image summarization, optical character recognition etc. This work also contributes towards efficient coupling of ASR with various downstream algorithms operating on ASR outputs. We prove the efficacy of the Confusion2Vec by proposing a recurrent neural network based spoken language intent detection to achieve state-of-the-art results under noisy ASR conditions. We demonstrate through experiments and our proposed model that ASR often makes errors relating to acoustically similar words and the confusion2vec with inherent model of acoustic relationships between words is able to compensate for the errors. Improvements are also demonstrated when training the intent detection models on noisy ASR transcripts. This work opens new possible opportunities in incorporating the confusion2vec embeddings to a whole range of full-fledged applications. Further, we extend the previously proposed confusion2vec by encoding each word in confusion2vec vector space by its constituent subword character n-grams. We show the subword encoding helps better represent the acoustic perceptual ambiguities in human spoken language via information modeled on lattice structured ASR output. The efficacy of the subword-confusion2vec is evaluated using semantic, syntactic and acoustic analogy and word similarity tasks. We demonstrate the benefits of subword modeling for acoustic ambiguity representation on the task of spoken language intent detection. The results significantly outperform existing word vector representations as well as the non-subword confusion2vec word embeddings when evaluated on erroneous ASR outputs. We demonstrate confusion2vec subword modeling eliminates the need for retraining/adapting the natural language understanding models on ASR transcripts.