Self-Supervised Ethogram Discovery

Earth Species Project
10 min readMay 3, 2022

We are excited to share news about a new research project underway at the Earth Species Project, spearheaded by Earth Species Project Senior AI Researcher Benjamin Hoffman.

Wild Griffon vulture (Gyps fulvus) at Gamla Nature Reserve, Israel

Project Summary

As animal-attached tags (‘biologgers’) have gotten smaller and lighter, there has been an explosion of data collected with microphones, accelerometers, and other sensors from wild animals.

Biologgers give an unprecedented view into the lives of a wide range of species, but come with two major challenges:

(i) The amount of recorded data is too vast to analyze using traditional methods.

(ii) The interpretation of the raw sensor data in terms of underlying behaviors is challenging, especially for animals that can’t be directly observed.

Machine learning (ML) is a useful tool to overcome these challenges. Yet, most current ML-based models are supervised, meaning they are limited to learning behavioral labels from human annotations.

A powerful new approach in ML called ‘self-supervision’ allows models to discover patterns in data without the need for human annotations. We expect that this technique will overcome the above challenges when applied to the problem of detecting behavioral patterns in raw sensor data.

ESP is embarking upon a research program with the following goals:

  1. Build a diverse collection of biologger datasets with known ground truth behavior labels, allowing us to measure model performance as a public benchmark.
  2. Build and test self-supervised models for the automatic discovery and labeling of behavioral motifs in animal body motion.
  3. Make the data and code publicly available for others to use and contribute to, in the expectation that this will accelerate further research in the field of computational ethology.
“Crowned Elephant Seal“ ”by Etienne Pauthenet

In Detail:

To understand an animal’s behavior, scientists construct an inventory of what types of actions it performs. This inventory, called an ‘ethogram’, is then used to classify observed actions. One can then quantify how frequently an animal performs a specific type of action, and how this rate varies with other factors.

However, observing wild animals in their natural habitats is challenging, and how an ethogram is used in practice varies between human observers. Studies using multiple observers require efforts to mitigate such biases in order to avoid measurement error.

It has recently become possible to automatically monitor animal body motion using animal-borne tags (Rutz and Hays, 2009), or third-person video footage (Mathis et al., 2018) (Figure 1). Using these data, ML models can produce descriptions of animal behavior in terms of categories specified by an ethogram at large scales, without introducing human bias (Egnor and Branson, 2016; Wang, 2019).

Yet, ML-based descriptions of animal behavior to date are typically created in a supervised setting (Wang, 2019). Here, a model is first trained to reproduce the behavioral labels generated by a human on a subset of the available data, and then can be used to predict labels for previously unseen data. This approach requires a high amount of human effort, in order to label enough data to train the model.

An alternative approach is to use self-supervision, in which a model discovers an ethogram and classifies recorded behavior without the need for human annotations (Egnor and Branson, 2016). Here, the model is first trained to perform an auxiliary task; for example, it may be shown three seconds of data and asked to predict the next three seconds. In doing so, the model learns to recognize repeated motifs in the data. In a follow-up step, a human expert can examine a group of motifs that the model predicts to be similar, and interpret their behavioral significance in the sense of an ethogram (Figure 2). Because they do not require training labels, self-supervised models are species-agnostic. The same training framework can be applied to any unlabeled biologger data.

In designing our self-supervised model, we will take inspiration from recent work in zero resource natural language processing. Models with masked (and sometimes quantized) latent variables have shown success in self-supervised discovery of acoustic units, such as phonemes (Baevski et al., 2019; Hsu et al., 2021; van Niekerk et al., 2020). Analogously, we are interested in the discovery of behavioral units, and we believe this problem may be amenable to a similar modeling framework.

While self-supervised models have been used for ethogram discovery before, prior work in this area has significant shortcomings. Some models assume that animal motion is Markovian (Wiltschko et al., 2015), or rely on simple out-of-the-box methods (Sakamoto et al., 2009). Models are often tested on a single dataset, and so may not generalize well to other contexts (Berman et al., 2009; Luxem et al., 2020; Sakamoto et al., 2009; Wiltschko et al., 2015). Additionally, there exist no generally recognized benchmark datasets or metrics for ethogram discovery, in spite of recognition that these are vital for advancing computational approaches to biology (Tuia et al., 2022). As a result, it is difficult to measure progress in this field.

By collecting datasets representative of a wide variety of species and biologgers, we aim to make it possible to benchmark model performance and identify which modeling frameworks are best suited for discovering ethograms. Using these benchmark datasets, we will engineer a state-of-the-art model for self-supervised ethogram discovery. In particular, our focus will be on body motion data collected using accelerometers and gyroscopes, and follow-up work will include additional modalities such as audio. We will make the datasets and model source code publicly available, for use by biologists and for future development by ML researchers.

Minke Whale by Ari Friedlaender

Goals and Objectives

This project has three main objectives:

  1. The collection of a set of public benchmark datasets and the definition of evaluation metrics, in order to create common standards for quantifying the performance of automated ethogram discovery models.
  2. The development of an open-source, self-supervised ML model that discovers biologically meaningful behavioral motifs from unlabeled motion data.
  3. The public release of these data and model code, to serve as a foundation for future research in this area.

For each species, we will evaluate models based on how well their discovered behavioral motifs are able to recover the ground truth human labels, using standard clustering performance metrics such as precision, recall, and f1 score.

We will test our models along with previous baseline models on our benchmark datasets. We will also perform end-user testing with biologist partners to verify that our model consistently discovers biologically meaningful behavioral motifs.

We will publish our findings in a peer-reviewed open-access academic journal, and make all data sets and model code publicly available in order to accelerate further research in this field.


In order to build a collection of benchmark biologger datasets, we have identified some open-source data sources (e.g. Jeantet et al., 2020; Vehkaoja et al., 2022) and are working closely with our partners, Dr. Christian Rutz and Dr. Ari Friedlaender, to source datasets from their labs and among researchers in the ethology community.

Evolutionary biologist Dr. Christian Rutz led the teams that first deployed miniature video cameras (Rutz et al., 2007) and proximity loggers (Rutz et al., 2012) on wild birds, and is the founding President of the International Bio-Logging Society. He is currently co-leading the COVID-19 Bio-Logging Initiative to analyze animal movements before, during, and after the “Anthropause” to understand how wildlife behavior is affected by the presence and mobility of humans.

Behavioral ecologist Dr. Ari Friedlaender has been instrumental in the development of animal-borne sensors for marine mammals, and his lab, the Friedlaender Lab of Biotelemetry and Behavioral Ecology at the University of California Santa Cruz maintains one of the largest marine motion-sensing tag databases in the world.


By developing a public benchmark for our ethogram discovery task, we will push research at the intersection of ML and biology to adopt common benchmarks and evaluation metrics and set a common standard for all future models for ethogram discovery. Just as common benchmarks have accelerated progress in computer vision (Russakovsky et al., 2015), we expect this to have a similar effect in computational ethology.

By making our source code, datasets, and documentation publicly available, we will give other researchers powerful tools to quantify the expression of behaviors across species, including those that are difficult to observe by humans. Alleviating the need for human annotation effort will unlock terabytes of previously recorded data.

We furthermore expect our models to have wide-ranging implications for behavioral ecology, as biologgers have been used to describe foraging dynamics, energetics, and social behavior (Kays et al., 2015; Hussey et al., 2015). Our models would enable these behavioral studies to harness the power of big data.

Conservation scientists use biologger data to predict the effects of human disturbance on wildlife populations, such as the impacts of military sonar on blue whale behavior (Southall et al., 2019). Using our proposed models, these scientists would have a tool to rapidly generate a detailed description of an individual’s activities before and after a disturbance. These insights may pave the way for mitigation strategies, including policy change and stricter animal protection legislation.

Moreover, behavioral indicators are used as early measures of the success of a conservation management program (Berger-Tal et al., 2011). Because our models will allow researchers to rapidly analyze massive amounts of data, they will create opportunities to incorporate biologgers into the evaluation of these programs.

Last but not least, this project will be an important step towards our ultimate goal of decoding animal communication, which is always embedded in the context of a physical environment. By interpreting animal motion, we can provide context for interpreting its communication. Our approach to decoding animal communication involves applying machine learning methods to large scales of acoustic data. The model developed in this project will allow us to integrate kinematic data, at scale, into our analyses.

Works Cited

Baevski, A., Schneider, S., & Auli, M. (2019). vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453.

Berman, G. J., Choi, D. M., Bialek, W., & Shaevitz, J. W. (2014). Mapping the stereotyped behaviour of freely moving fruit flies. Journal of The Royal Society Interface, 11(99), 20140672.

Berger-Tal, O., Polak, T., Oron, A., Lubin, Y., Kotler, B. P., & Saltz, D. (2011). Integrating animal behavior and conservation biology: a conceptual framework. Behavioral Ecology, 22(2), 236–239.

Egnor, S. R., & Branson, K. (2016). Computational analysis of behavior. Annual review of neuroscience, 39, 217–236.

Hsu, W. N., Bolte, B., Tsai, Y. H. H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451–3460.

Hussey, N. E., Kessel, S. T., Aarestrup, K., Cooke, S. J., Cowley, P. D., Fisk, A. T., … & Whoriskey, F. G. (2015). Aquatic animal telemetry: a panoramic window into the underwater world. Science, 348(6240), 1255642.

Jeantet, L., Dell’Amico, F., Forin-Wiart, M. A., Coutant, M., Bonola, M., Etienne, D., … & Chevallier, D. (2018). Combined use of two supervised learning algorithms to model sea turtle behaviours from tri-axial acceleration data. Journal of Experimental Biology, 221(10), jeb177378.

Kays, R., Crofoot, M. C., Jetz, W., & Wikelski, M. (2015). Terrestrial animal tracking as an eye on life and planet. Science, 348(6240), aaa2478.

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.

Luxem, K., Fuhrmann, F., Kürsch, J., Remy, S., & Bauer, P. (2020). Identifying behavioral structure from deep variational embeddings of animal motion. BioRxiv,

Mathis, A., Mamidanna, P., Cury, K. M., Abe, T., Murthy, V. N., Mathis, M. W., & Bethge, M. (2018). DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nature neuroscience, 21(9), 1281–1289.

Pagano, A. M., 2018, Metabolic Rate, Body Composition, Foraging Success, Behavior, and GPS Locations of Female Polar Bears (Ursus maritimus), Beaufort Sea, Spring, 2014–2016 and Resting Energetics of an Adult Female Polar Bear: U.S. Geological Survey data release,

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., … & Fei-Fei, L. (2015). Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3), 211–252.

Rutz, C., Bluff, L. A., Weir, A. A., & Kacelnik, A. (2007). Video cameras on wild birds. Science, 318(5851), 765–765.

Rutz, C., Burns, Z. T., James, R., Ismar, S. M., Burt, J., Otis, B., … & St Clair, J. J. (2012). Automated mapping of social networks in wild birds. Current Biology, 22(17), R669-R671.

Rutz, C., & Hays, G. C. (2009). New frontiers in biologging science. Biol. Lett. 5, 289–292.

Sakamoto, K. Q., Sato, K., Ishizuka, M., Watanuki, Y., Takahashi, A., Daunt, F., & Wanless, S. (2009). Can ethograms be automatically generated using body acceleration data from free-ranging birds?. PloS one, 4(4), e5379.

Southall, B. L., DeRuiter, S. L., Friedlaender, A., Stimpert, A. K., Goldbogen, J. A., Hazen, E., … & Calambokidis, J. (2019). Behavioral responses of individual blue whales (Balaenoptera musculus) to mid-frequency military sonar. Journal of Experimental Biology, 222(5), jeb190637.

Tuia, D., Kellenberger, B., Beery, S., Costelloe, B. R., Zuffi, S., Risse, B., … & Berger-Wolf, T. (2022). Perspectives in machine learning for wildlife conservation. Nature communications, 13(1), 1–15.

Van Engelen, J. E., & Hoos, H. H. (2020). A survey on semi-supervised learning. Machine Learning, 109(2), 373–440.

van Niekerk, B., Nortje, L., & Kamper, H. (2020). Vector-quantized neural networks for acoustic unit discovery in the zerospeech 2020 challenge. arXiv preprint arXiv:2005.09409.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Vehkaoja, A., Somppi, S., Törnqvist, H., Cardó, A. V., Kumpulainen, P., Väätäjä, H., … & Vainio, O. (2022). Description of movement sensor dataset for dog behavior classification. Data in Brief, 107822.

Wang, G. (2019). Machine learning for inferring animal behavior from location and movement data. Ecological informatics, 49, 69–76.

Wiltschko, A. B., Johnson, M. J., Iurilli, G., Peterson, R. E., Katon, J. M., Pashkovski, S. L., … & Datta, S. R. (2015). Mapping sub-second structure in mouse behavior. Neuron, 88(6), 1121–1135.



Earth Species Project

We are an open-source collaborative and nonprofit dedicated to decoding non-human language.