Jan 6, 2020

Sound-Based Bird Classification

By Agnieszka Mikołajczyk, Magdalena Kortas

How a WiMLDS Trójmiasto team used deep learning, acoustics, and ornithology to classify bird species from sound.

Bird Song Classification

Have you ever wondered about the name of the bird you just heard singing? A group of women from the local Polish chapter of Women in Machine Learning & Data Science (WiMLDS) not only thought about it but also decided to create a solution, on their own, to detect bird species based on the sound they make.

Female data scientists, PhD candidates, ornithologists, data analysts and software engineers who had prior experience with Python joined forces in a series of two-week-long sprints to work together on the project.

This project was designed to be a collaboration on a real-life problem which machine learning can help to solve, with a typical structure of a data science project: data research and analysis, data preparation, model creation, result analysis, model improvement and final presentation.

After weeks of work, the group managed to build a solution that predicts the right bird's name with 87% accuracy on the test sample.

Are you curious about the solution that has been built? We invite you to travel into a world of bird songs.

The birds' problem

Birdsong analysis and classification is a very interesting problem to tackle.

Birds have many types of voices and the different types have different functions. The most common are song and other voices, e.g. call-type.

The song is the "prettier", melodic type of voice, thanks to which birds mark their territory and get partners. It is usually much more complex and longer than a call.

Call-type voices include contact, enticing and alarm voices. Contact and attracting calls are used to keep birds in a group during flight or foraging, for example in the treetops. Alarm calls alert others when, for example, a predator arrives. Most often these are short and simple voices.

Example: great tit (Parus major).

The song is a simple lively rhythmic verse with a slightly mechanical sound, e.g. "te-ta te-ta te-ta" or a three-syllable phrase with a different accent, "te-te-ta te-te-ta te-te-ta".
The call has a rich repertoire: joyful "ping ping" voices, cheerful "si yut-tee yut-tee" and chattering "te tuui". In the autumn you can often hear slightly questioning, more shy "te te tiuh". It warns with a hoarse crackling "yun-yun-yun-yun". The ramps fill the forest with persistent penetrating "te-te-te te-te-te".

Why can sound-based bird classification be challenging?

There are many problems you can encounter:

background noise, especially while using data recorded in a city, e.g. city noises, churches, cars;
multi-label classification, when there are many species singing at the same time;
different types of bird sounds;
inter-species variance, where the same species can sound different across regions or countries;
dataset issues, including imbalance, many species, different recording lengths, and varying recording quality.

So, how were the problems solved in the past?

Recognizing birds just by their songs might be a difficult task but it does not mean it is not possible. To find the answer, we needed to dive into research papers and discovered that most of the work happened to be initiated by various AI challenges, such as BirdCLEF and DCASE.

Fortunately, winners of those challenges usually describe their approaches, so after checking the leaderboards we obtained several interesting insights:

almost all winning solutions used Convolutional Neural Networks (CNNs) or Recurrent Convolutional Neural Networks (RCNNs);
the gap between CNN-based models and shallow, feature-based approaches remained considerably high;
even though many recordings were noisy, CNNs worked well without additional noise removal and many teams claimed that noise reduction techniques did not help;
data augmentation techniques were widely used, especially audio-processing techniques such as time or frequency shift;
some winning teams successfully used semi-supervised learning methods such as pseudo-labeling, and some increased AUC by model ensembling.

But how do we apply CNNs, neural networks designed to extract features from images, when we only have sound recordings? Mel-frequency cepstrum (MFCC) is the answer.

From audio to mel spectrograms

Each sound we hear is composed of multiple sound frequencies at the same time. That is what makes audio sound "deep".

The trick of a spectrogram is to visualize those frequencies in one plot, instead of visualizing only the amplitude as in the waveform. Mel scale is known as an audio scale of sound pitches that seem to be in equal distance from each other for listeners. The idea behind that is connected with how humans hear. When we connect those two ideas, we get a modified spectrogram, a mel-frequency cepstrum, that ignores sounds humans do not hear and plots the most important parts.

python

                SOUND_DIR = "../data/xeno-canto-dataset-full/Parusmajor/Lithuania/Parusmajor182513.mp3"

signal, sr = librosa.load(SOUND_DIR, duration=10)

N_FFT = 1024
HOP_SIZE = 1024
N_MELS = 128
WIN_SIZE = 1024
WINDOW_TYPE = "hann"
FEATURE = "mel"
FMIN = 1400

S = librosa.feature.melspectrogram(
    y=signal,
    sr=sr,
    n_fft=N_FFT,
    hop_length=HOP_SIZE,
    n_mels=N_MELS,
    htk=True,
    fmin=FMIN,
    fmax=sr / 2,
)

plt.figure(figsize=(10, 4))
librosa.display.specshow(librosa.power_to_db(S**2, ref=np.max), fmin=FMIN, y_axis="linear")
plt.colorbar(format="%+2.0f dB")
plt.show()

The longer the length of the audio from which a spectrogram is created, the more information you get in an image, but also the more your model can overfit. If your data has a lot of noise or silence, there is a chance that 5-second audio clips will not catch the needed information. Therefore, we decided to create images out of 10-second audio clips, which increased final model accuracy by 10%. Since birds sing in high frequencies, a high-pass filter was applied to remove useless noise.

Time to model

After creating mel spectrograms with a high-pass filter from 10-second audio files, the data was split into train, validation and test sets.

python

                IM_SIZE = (224, 224, 3)
BIRDS = [
    "0Parus", "1Turdu", "2Passe", "3Lusci", "4Phoen", "5Erith",
    "6Picap", "7Phoen", "8Garru", "9Passe", "10Cocco", "11Sitta",
    "12Alaud", "13Strep", "14Phyll", "15Delic", "16Turdu",
    "17Phyll", "18Fring", "19Sturn", "20Ember", "21Colum",
    "22Trogl", "23Cardu", "24Chlor", "25Motac", "26Turdu",
]
DATA_PATH = "data/27_class_10s_2/"
BATCH_SIZE = 16

Built-in Keras data generators took care of data augmentation and normalization of all spectrograms.

python

                train_datagen = ImageDataGenerator(
    preprocessing_function=preprocess_input,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.1,
    fill_mode="nearest",
)

train_batches = train_datagen.flow_from_directory(
    DATA_PATH + "train",
    classes=BIRDS,
    target_size=IM_SIZE,
    class_mode="categorical",
    shuffle=True,
    batch_size=BATCH_SIZE,
)

valid_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
valid_batches = valid_datagen.flow_from_directory(
    DATA_PATH + "val",
    classes=BIRDS,
    target_size=IM_SIZE,
    class_mode="categorical",
    shuffle=False,
    batch_size=BATCH_SIZE,
)

test_datagen = ImageDataGenerator(preprocessing_function=preprocess_input)
test_batches = test_datagen.flow_from_directory(
    DATA_PATH + "test",
    classes=BIRDS,
    target_size=IM_SIZE,
    class_mode="categorical",
    shuffle=False,
    batch_size=BATCH_SIZE,
)

The final model was built on EfficientNetB3 and 27 different classes (bird species) with Adam optimizer, categorical cross-entropy loss function and balanced class weights. Learning rate was reduced on plateau.

python

                net = efn.EfficientNetB3(
    include_top=False,
    weights="imagenet",
    input_shape=IM_SIZE,
)

x = net.output
x = Flatten()(x)
x = Dropout(0.5)(x)
output_layer = Dense(len(BIRDS), activation="softmax", name="softmax")(x)

net_final = Model(inputs=net.input, outputs=output_layer)
net_final.compile(
    optimizer=Adam(),
    loss="categorical_crossentropy",
    metrics=["accuracy"],
)

class_weights = class_weight.compute_class_weight(
    "balanced",
    np.unique(train_batches.classes),
    train_batches.classes,
)

python

                net_final.fit_generator(
    train_batches,
    validation_data=valid_batches,
    epochs=30,
    steps_per_epoch=1596,
    class_weight=class_weights,
    callbacks=[ModelCheck, ReduceLR],
)

Results

Finally, the solution predicted the right bird's name with 87% accuracy on the test sample with:

11 classes having F1-score over 90%;
8 classes having F1-score between 70% and 90%;
2 classes having F1-score between 50% and 70%;
6 classes having F1-score below 50%.

If you are interested in seeing the code in a Jupyter notebook, check the project repository linked from the original post.

Original article: Sound-Based Bird Classification

Sources

Original post →

← AI explained