How wav2vec2.0 takes input audio data

Introduction

Building an automatic speech recognition (ASR) system has become easier than ever thanks to the availability of deep learning and ASR toolkits. These toolkits magically pre-process an audio file and a neural model can seamlessly take processed audio features, letting users needing not to know much details.

This is my motivation to write a summary of how audio files are processed for a neural model. In this post, I will focus on wav2vec2.0 proposed by Baevski et al. which learns speech representations in a self-supervised manner. This post from Hugging Face is an excellent tutorial showing how to fine-tune wav2vec2.0 for the ASR task. I would like to provide information focusing only on input structure of wav2vec2.0 to compliment the tutorial.

Fig1: Audio input processing of wav2vec2.0 discussed in this post is the part circled by a red line. This figure is taken from Baevski et al.

Audio Data

Audio in the physical world is a continuous signal. On the other hand, audio in the computer world is a discrete representation consisting of separate values called audio samples. A computer processed audio file looks like below when plotted.

Fig2: An example audio waveform. The audio file used to create this plot is ID 8230-279154-0000 of LibriSpeech test-clean.

The above figure still looks like a continuous signal. However, plotting 100 audio samples starting from 2 seconds looks like below.

Fig3: 100 audio samples starting from 2 seconds of the file ID 8230-279154-0000.

This figure hopefully shows that computer processed audio is a discrete signal. Another important notion for audio processing is the sampling rate. The sampling rate is the number of samples captured in an audio file per second. Typically, ASR research sets the sampling rate to 16,000, meaning that 1 second of an audio file contains 16,000 samples. The sampling rate of the audio file used to create the above figures is also 16,000. So, duration of 100 samples corresponds to 100 / 16000 = 0.000625 seconds, or 0.625 milliseconds. It's very very short!

Feed Audio Data to wav2vec2.0

Fig 1 shows that wav2vec2.0 takes a raw audio waveform. The previous section described that the audio waveform consists of audio samples, typically 16,000 discrete values per second. The original wav2vec2.0 paper by Baevski et al. has pretty much 3 sentences regarding input audio processing if I'm not mistaken.

Zero Mean and Unit Variance

The first part of wav2vec2.0 audio processing is zero mean and unit variance normalisation.

The raw waveform input to the encoder is normalized to zero mean and unit variance.

Pre-processing acoustic features to zero mean and unit variance is a common practice to mitigate data variation (Viiki and Laurila, 1998). I am not sure how effective this pre-processing for neural models is though.

An example code block below shows mean and variance normalisation of an audio file using HuggingFace Wav2Vec2FeatureExtractor. The mean and variance of processed audio samples are 0.0 and 1.0, respectively.


import numpy as np
import soundfile as sf
from transformers import Wav2Vec2FeatureExtractor

# the same audio file from previous graphs
audio, sr = sf.read("data/LibriSpeech/test-clean/8230/279154/8230-279154-0000.flac")

# 0.00019265260972603013
print(np.mean(audio))

# 0.003338447929336512
print(np.var(audio))

feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(
    "patrickvonplaten/wav2vec2-base")

processed = feature_extractor(audio, sampling_rate=sr)['input_values'][0]

# -4.1631865e-10; very close to 0.0
print(np.mean(processed))

# 0.99997014 ; very close to 1.0 
print(np.var(processed))

Feature Encoding with CNNs

The second stage of wav2vec2.0 audio processing is encoding audio samples using CNN layers.

The feature encoder contains seven blocks and the temporal convolutions in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths (10,3,3,3,3,2,2). This results in an encoder output frequency of 49 hz with a stride of about 20ms between each sample, and a receptive field of 400 input samples or 25ms of audio.

The first CNN layer takes 10 audio samples as input and projects the samples into 512 dimensional features. The stride 5 means that the next input is the 5 samples processed and the 5 following samples. These samples are again transformed into 512 dimensional features.

The second CNN layer takes 3 (kernel size) x 512 dimensional 2D tensors and the only last tensor of the first operation becomes input of the next convolutional operation (stride=2). The figure below illustrates how first and second CNN layers process audio samples.

Fig4: Audio samples are shown as white circles. At the 1st convolutional layer, 10 samples are transformed into 512 dimensional features ("channels" in CNN term). The 2nd convolutional layer takes 3 x 512 tensors as input and produces 512 dimensional features again.

This operation continues through 3rd to 7th CNN layers and 16,000 audio samples are encoded into 512 x 49 tensors corresponding to 49hz as stated in the paper. The code block below demonstrates audio samples of 1 second (16,000 values) are indeed transformed into 49 x 512 features using HuggingFace Wav2Vec2Model.


import torch
from transformers import Wav2Vec2Model

model = Wav2Vec2Model.from_pretrained("patrickvonplaten/wav2vec2-base")

# create 1 second audio but all values are 1
v = torch.ones(16000)
v = v.unsqueeze(0)

# CNN encoders are named as feature_extractor
# torch.Size([1, 512, 49])
model.feature_extractor(v).size()

Conclusion

The wav2vec2.0 model is one of the most important self-supervised speech models. There are many online resources describing the core learning algorithm of wav2vec2.0. Many deep learning toolkits including HuggingFace have made wav2vec2.0 very accessible for ASR and other speech processing tasks. With that said, there is not much coverage of how audio processing is carried out by wav2vec2.0 and this was my motivation to write this post. I hope this post was useful for someone non-speech background to understand how input audio is processed for wav2vec2.0.

Yasu Research