“A Study of Unsupervised Speaker Indexing”
by Soon-Il Kwon
May 2005
Speaker indexing sequentially detects points where speaker identity changes in a multi-speaker audio stream, and classifies each detected segment according to the speaker's identity. This thesis addresses three challenges: The first relates to efficient sequential speaker change detection. The second relates to the fact that the number/identity of the speakers is unknown. The third relates to building speaker models with only small amounts of training data. To address the first issue, a localized search algorithm is proposed which aims to provide speaker change detection with minimal amounts of data for speech analysis. To address the issue of speaker modeling under unsupervised data conditions, a novel predetermined generic speaker-independent model set, called the Sample Speaker Models (SSM), is proposed. This set can be useful for more accurate speaker modeling and clustering without requiring training models on target speaker data. Once a speaker-independent model is selected from SSM, it is progressively adapted into a speaker-dependent model. Experiments were performed with data from the Speaker Recognition Benchmark NIST Speech (1999) and the HUB-4 Broadcast News Evaluation English Test Material (1999). Results showed that our new technique, sampled using the Markov Chain Monte Carlo Method, gave 92.5% indexing accuracy on 2 speaker telephone conversations, 89.6% on 4 speaker conversations with the telephone speech quality, and 87.2% on broadcast news. SSM outperformed the Universal Background Model by up to 29.4% absolute and the Universal Gender Models by up to 22.5% absolute in indexing accuracy in the experiments. While SSM is useful in unsupervised speaker indexing, an optimal sampling method is still required. To solve this problem, the Speaker Quantization method, motivated by Tree Structured Vector Quantization, is proposed and experimentally compared with the MCMC approach. Experimental results showed that the new sampling approach outperformed the random selection by 22.7% relative in error rate on telephone conversations, 19.8% relative on broadcast news even though it is not optimal. We also analytically studied the capacity of unsupervised speaker indexing. From the analytic approach, we found that the similarity between speakers to be indexed plays an important role in determining the most appropriate number of sample speaker models.