“Multimodal and Self-guided Clustering Approaches Toward Context Aware Speaker Diarization”
by Taejin Park
May 2021
Speaker diarization has become an important field in recent years owing to the growing demand for conversational artificial intelligence and interactive entertainment systems. Within the process of a conversation analysis system, speaker labels should be predetermined before the natural language understanding units, allowing the system to achieve an accurate understanding regarding the content of the conversation. Thus, speaker diarization plays a crucial role in automatic discourse analysis systems as an essential preprocessing step.
In this dissertation, we propose techniques for multimodal speaker diarization approaches and self-guided clustering methods that can help in achieving the goal of a context aware speaker diarization system. Thus, our proposed speaker diarization system focuses on two aspects. First, we focus on a self-guided speaker diarization system that can determine the parameters on its own, based on the context of the input samples. This line of research includes a clustering phase and parameter tuning during speaker representation learning, and the determination of an adequate segment window length. We demonstrate that certain parameter tuning processes needed to perform a speaker diarization task can be automated. Second, we also investigate a method for incorporating other modalities, such as the lexical context, into a speaker diarization system. We show that, by incorporating the lexical context, the accuracy of the estimated speaker labels can be improved in the temporal domain. In doing so, we suggest a futuristic speaker diarization system that we will likely see in both industry and academia.
The overall objective of this dissertation is to propose novel techniques for improving the speaker diarization system using the aforementioned methods. In addition, we cover the machine learning approaches behind the proposed techniques and how we can model the clustering and speaker recognition problems.