Towards Solving the Cocktail Party Problem for Animals

6 min readFeb 24, 2022

Two sibling Gelada Monkeys, Simien Mountains National Park, Ethiopia Photo by Marc Guitard

At a crowded party or in a noisy restaurant, most of us do something remarkable. Out of all the sounds and voices around us, we can attribute them to distinct sources, and even attend to one specific speaker in a group and ignore the rest. In 1953, British scientist Colin Cherry named this phenomenon the “cocktail party problem” (or the problem of source separation), and since then it has been a topic of study in the fields of psychology, neuroscience, computer science, and biology.

In recent years, machine learning techniques have increasingly been applied to recordings of animal sounds, redefining the state-of-the-art in tasks such as sound event detection and classification. The cocktail party problem — the problem of isolating and recognizing individual “speakers” (or more general: sound sources) from a recording with overlapping signals — has historically presented a major challenge in the processing of animal sound data.

In practice, the cocktail party problem is a major roadblock in the study of the natural world: Given the difficulty of separating acoustic mixtures, biologists are often forced to discard large amounts of data with overlapping vocalizations.

Until recently, computers have struggled in solving the cocktail party problem. That’s now changing.

While some advances were made for solving the cocktail party problem for spoken human language, little work was done in the domain of bioacoustics. This motivated us to tackle this challenging problem in the field of animal vocalizations. We introduce BioCPPNet, available in Scientific Reports, the first published neural network based approach to source separation for bioacoustic data.

Multiple Belugas communicating. Note how it is hard to tell which whale is speaking when and what. For scientists studying groups of animals — pods of whales, flocks of birds, herds of elephants — the majority of communication data is often omitted from study. Source: https://ocr.org/sounds/belugas

To address the problem of bioacoustic source separation we implement a complete pipeline to disentangle mixtures of overlapping animal calls. We apply our framework to a variety of species including rhesus macaques, bottlenose dolphins, Egyptian fruit bats, dogs, humpback whales, sperm whales, and elephants.

The details of our implementation and these experiments can be found in our recent paper BioCPPNet: automatic source separation with deep neural networks.

To the best of our knowledge, this paper describes the first end-to-end technique for separating overlapping non-human communication, recently joined by Google’s MixIt model — and we couldn’t be more excited for what this can enable for the wider biology and conservation community.

We’ve made the source code available on Github so that anyone can start using it.

Hear It

How does it sound? Here is our technique working on separating a variety of species. Note that bats and dolphins can hear frequencies 10-times higher than the most sensitive humans!

Source separations for two dogs, three macaques, and two dolphins.

The Approach

We selected a diverse set of animal species — bats, dolphins, and monkeys — all of whom have very different vocal behaviors, and extracted their vocalizations from field recordings. We first constructed a dataset by taking isolated calls from different individuals and overlapping them to create synthetic mixtures that could be plausible in real-world scenarios. The goal is to train a neural network to be able to ‘listen’ to the mixtures in the dataset and solve the source separation problem in this case. The model outputs both spectrograms and waveforms of the separated sounds. Because we have created the mixtures synthetically from solo calls, during training we can evaluate the model’s performance by comparing the original calls against the model’s output.We based our work on an architecture known as a UNet, which was originally developed to analyze biomedical images by assigning each pixel of the image a semantically meaningful type. For example, by being able to tell which pixels belong to the lungs, clavicle, and heart in an X-Ray.

The UNET Architecture is good at assigning a category to each pixel of an image. Here, a UNET is used to find the lungs, clavicle, and heart in an X-Ray. Image source: *Fully Convolutional Architectures for Multi-Class Segmentation in Chest Radiographs*

The intuition is that if we first create a visual representation of the sound*, then we will be able to train a UNET-based neural network to determine which pixel in this representation belongs to which speaker, and from there we can reconstruct the individual sounds.

The Bioacoustic Cocktail Party Problem Network — or BioCPPNet — uses a custom UNET architecture to separate multiple animals vocalizing at the same time.

Performance

When working with human speech, we as humans can listen to the results to see if the separation ‘makes sense’, i.e., that the algorithm is not reconstructing something that sounds reasonable but is not. With non-human communication, we lack the biological apparatus to tell if the separation ‘makes sense’. To evaluate our model’s performance, we assessed the performance of BioCPPNet using a number of quantitative and qualitative metrics, including

(1) an abstract but useful scale-invariant signal-to-distortion ratio (SI-SDR),

(2) downstream classification accuracy, and

(3) visual inspection of spectrograms.

In particular, for (2), to quantify the fidelity of BioCPPNet’s predictions for the reconstructed sources, we fed the separation model’s outputs into separate deep neural networks trained to label the identity of the signaller. This metric makes the assumption that if the model does a good job of separating the calls in the mixture, then the predictions should preserve as many of the characteristics of the original call as possible.

Good news: we found that separated mixtures produced by the model were useful for these downstream tasks. For example, even after mixing and separating two rhesus macaque coo signals, the downstream classifier model was able to classify the identity of the separated calls with an accuracy of 93.7%. For comparison, our classifier achieved a state-of-the-art accuracy of 99.3% on the original unmixed macaque data.

Future Directions

As our approach makes use of supervised training algorithms, a lingering question that remains is the capacity for our model to generalize to real-world settings. Because of the limited number of recordings of individual animals, future directions should consider weakening the degree of supervision, so that the algorithm can learn what a good separation is from the data itself.

One step in this direction is Google AI’s recently published unsupervised birdsong source separation method, which employs the MixIT technique of unmixing mixtures of mixtures.

Additionally, our approach will benefit from larger datasets. For example, we find that in the cases of macaque coos and signature bottlenose dolphin whistles, for which we used less than an hour of acoustic data, BioCPPNet is limited in its ability to generalize to generalize to vocalizations made by speakers whose vocalizations were not represented in the training data. However, we observed improved generalization using an Egyptian fruit bat dataset that exceeded the size of the other data sets by several orders of magnitude. With this in mind, and noting the lack of standard benchmark data sets in the field of bioacoustics, we are also working towards curating larger public data sets and benchmarks in cooperation with our biology partners.

There are many potential applications for extensions of this work. For scientists studying groups of animals — pods of whales, flocks of birds, herds of elephants — the majority of communication data is often omitted from study. By being able to separate individual speakers, our understanding of social communications can increase rapidly. Another example: In rainforest conservation, passive acoustic monitors collect massive amounts of acoustic data which are used to create a measure of the health of the rainforest’s biodiversity. Better source separation can help create faster, cheaper, and more accurate measures of what species and how many individuals are present.

Get Involved

As an organization committed to open-source science, we have made the code available on our GitHub, and we encourage contributors to connect with us on Discord. You can read more about ESP on our website.