The USC Andrew and Erna Viterbi School of Engineering USC Signal and Image Processing Institute USC Ming Hsieh Department of Electrical and Computer Engineering University of Southern California

Technical Report USC-SIPI-430

“Classification and Retrieval of Environmental Sounds”

by Sachin Chachada

May 2016

Audio processing research has primarily been focused on speech and music signals. However, research on general audio or environmental sound processing, i.e. audio signals other than speech an music, has been scant. Owing to its numerous applications such as those in the fields of surveillance, bio-diverity monitoring, robot audition, etc., there is a need of a good Environmental Sound Recognition (ESR) system. Hence, in this dissertation, we work on classification and retrieval systems for environmental sounds.

In order to build an efficient ESR model, it is important to be able to characterize environmental sounds. Environmental sounds are rich in both context and content. What sets them apart from music and speech is their non-stationary nature. Hence, recent work has focused on the study and development of features for environmental sound that focus on their non-stationary characteristics. This work first focuses on assessing these features by providing a common test database. Analysis of these features helped us in understand their power and limitations. Features motivated by dual time-frequency (TF) representation have become quite popular and have been proven successful. Among these, sparse representation over Gabor dictionary is a popular and recent feature. However, we will show that these features fail when applied to a large and diverse database. Thus, we propose a modification to these features by first filtering a signal using a narrow-band filterbank and then extracting these features for each filtered band-limited signal. The proposed features, Narrow Band Time Frequency features, are shown to be robust for large scale databases.

Environmental sounds are complex signals. Hence, we believe it is hard to find a single feature which would work for any database, and would be scalable to a large number of sounds. Thus, we leverage the machine learning approaches of decision fusion, also known as Ensemble learning, classifier fusion, multiple-experts, and propose a multi-classifier system for this task. The proposed Para-Boost Multi-Classifier (PB-MCS) model, takes the advantages of all the features and improves the overall performance of ESR system. PB-MCS uses vertical decomposition, i.e. decomposition of data-matrix along feature dimension, to form individual experts and finally combine the predictions of these experts. We also propose several variations of PB and study them in detail.

Considering the exponential growing environmental sound data on the Internet, we need a good content based retrieval system. Audio data on the Internet is often tagged with class labels and hence it is reasonable to assume that the data-base is partially labeled. By allowing database to be partially labeled, we take the advantage of labels to narrow the search for relevant sounds and allow room for growing the database without assigning labels to each document. To this end, we present a two stage content based (query-by-example) environmental sound retrieval system. In Stage I, we first exploit the signal characteristics such as time localized and frequency localized energy distribution to do a broad categorization of environmental sounds. This not only reduces the potential query matching complexity, but also enables us to customize ensuing steps that exploit these characteristics of environmental sounds. Next, for each category, a classifier is trained to predict labels for unlabeled data in the database and also narrow search range for a query by assigning it multiple, yet limited, class labels. In Stage II, we propose a novel feature and a scoring scheme to do local matching and ranking. First, each audio clip is segmented based on any extracted features. This segmentation is done using Mean Shift approach, and hence is an unsupervised segmentation. Then, relevant segments are extracted for each clip, and each segment is represented by its point of convergence in the feature space. The audio signal is finally represented by its energy distribution over each segment thereby capturing the temporal variations of the audio signal in feature space. Given a query, first audio segments of a document are mapped to those of the query. Then the document is assigned a score based on energy distribution of mapped segments only.

To download the report in PDF format click here: USC-SIPI-430.pdf (13.6Mb)