USC-SIPI REPORT #387

Technical Report USC-SIPI-387

“Active Data Acquisition for Building Language Models for Speech Recognition”

by Abhinav Sethy

August 2007

The ability to build task specific language models, rapidly and with minimal human effort, is an important factor for fast deployment of natural language processing applications such as speech recognition in different domains. Although in-domain data is hard to gather, we can utilize easily accessible large sources of generic text such as the Internet (WWW) or the GigaWord corpus for building statistical task language models by appropriate data selection and filtering methods. In this work I show that {f significant improvements in language model performance can be achieved by simultaneously boosting the coverage and relevance of the generic corpus.}The relevance of an adaptation corpus depends on the degree to which its style and content matches the domain of interest. The mismatch between in-domain data and generic corpora can be seen as a semi-supervised learning problem. We can model the generic corpus as a mix of sentences from two classes: in-domain ({f I}) and noise ({f N}) (or out-of-domain). The labels {f I} and {f N} are latent and unknown for the sentences in the generic corpus, but we usually have a small number of examples of {f I} from the limited in-domain data. Selecting the right labels for the unlabeled set is important for benefiting from it. I will show that similar to the question of balance in semi-supervised learning for classification, there is a need to address the question of distributional similarity while selecting the appropriate utterances for building a language model from noisy data. The subset of sentences from generic corpora which are selected to build the adaptation language should have a distribution similar to the in-domain data model. To address the issue of distributional similarity, I will present an incremental algorithm that compares the distribution of the selected set and the in-domain examples by using a relative entropy (R.E) criterion. Experimental results are provided which show the superiority of the proposed algorithm over existing schemes.The coverage of an adaptation corpus is an indication of the degree with which it covers the topics or styles implicit in the domain of interest. I will present methods that use clustering for querying/merging to achieve significant performance improvements. In some speech recognition applications such as spoken document retrieval, automated call centers a lot of untranscribed speech data is available. I will present methods that utilize hypothesis generated from this raw speech data in conjunction with the generic corpus to build better language models by boosting coverage.

This report is not currently available in PDF format for downloading. Contact the Signal and Image Processing Institute for information on its availability.