“Robust Speaker Clustering Under Variation in Data Characteristics”
by Kyu Jeong Han
December 2009
Speaker clustering refers to the process of classifying a set of input speech data (or speech segments) by speaker identity in an unsupervised way, based on measuring the similarity of speaker-specific characteristics between the data. The process identifies which speech segments belong to the same speaker source without any prior speaker-specific information of the given input data. This speaker-perspective, unsupervised classification of speech data can be applied as a pre-processing step to speech/speaker recognition or multimedia data segmentation/classification in various ways. For this reason speaker clustering has been recently attracting much attention in the research area of speech recognition and multimedia data processing.
One big, yet unsolved, issue in the research field of speaker clustering is unreliable clustering performance under the variation of input speech data. In this dissertation, we deal with this problem in the framework of agglomerative hierarchical speaker clustering (AHSC) in two perspectives: stopping point estimation and inter-cluster distance measurement. In order to improve the robustness of stopping point estimation for AHSC under the variation of input speech data, we propose a new statistical measure, called information change rate (ICR), that can help better and more robustly estimating the optimal stopping point. The ICR-based stopping point estimation method is not only empirically but also theoretically verified to be more robust to the variation of input speech data than the conventional BIC-based one. In order to improve the robustness of inter-cluster distance measurement for AHSC under the variation of input speech data, we also propose selective AHSC and incremental Gaussian mixture cluster modeling. Thesetwo approaches are proven through various types of experiments to provide much more reliability for speaker clustering performance under the variation of input speech data.
Based on these results on robust speaker clustering under the variation of input speech data, we extend our interest to implementing a speaker diarization system, which is more robust to the variation of input audio data. (Speaker diarization refers to an automated process that can annotate a given audio source in terms of ``who spoke when''.) Focusing ourselves on speaker diarization of meeting conversations speech, we propose two refinement schemes to further improve the reliability of speaker clustering performance in the framework of speaker diarization under the variation of input audio data. One is selection ofrepresentative speech segments and the other is interaction pattern modeling between meeting participants, and both of them are experimentally verified to enhance the reliability of speaker clustering performance and thus the overall diarization accuracy under the variation of input audio data.