How wav2vec2.0 takes input audio data
Introduction
Building an automatic speech recognition (ASR) system has become easier than
ever thanks to the availability of deep learning and ASR toolkits. These
toolkits magically pre-process an audio file and a neural model can seamlessly
take processed audio features, letting users needing not to know much details.
This is my motivation to write a summary of how audio files are processed for
a neural model. In this post, I will focus on wav2vec2.0 proposed by Baevski et al. which learns speech representations in a
self-supervised manner. This post from Hugging Face is an excellent tutorial showing how to fine-tune
wav2vec2.0 for the ASR task. I would like to provide information focusing only
on input structure of wav2vec2.0 to compliment the tutorial.
Fig1: Audio input processing of wav2vec2.0 discussed in this post is the part circled by a red line. This figure is taken from Baevski et al. |
Audio Data
Audio in the physical world is a continuous signal. On the other hand, audio
in the computer world is a discrete representation consisting of separate
values called audio samples. A computer processed audio file looks
like below when plotted.
Fig2: An example audio waveform. The audio file used to create this plot
is ID 8230-279154-0000 of
LibriSpeech test-clean. |
The above figure still looks like a continuous signal. However, plotting 100 audio samples starting from 2 seconds looks like below.
Fig3: 100 audio samples starting from 2 seconds of the file ID 8230-279154-0000. |
This figure hopefully shows that computer processed audio is a discrete
signal. Another important notion for audio processing is the sampling rate. The sampling rate is the number of samples captured in an audio file per
second. Typically, ASR research sets the sampling rate to 16,000, meaning that
1 second of an audio file contains 16,000 samples. The sampling rate of the
audio file used to create the above figures is also 16,000. So, duration of
100 samples corresponds to 100 / 16000 = 0.000625 seconds, or 0.625
milliseconds. It's very very short!
Feed Audio Data to wav2vec2.0
Fig 1 shows that wav2vec2.0 takes a raw audio waveform. The previous section
described that the audio waveform consists of audio samples, typically 16,000
discrete values per second. The original wav2vec2.0 paper by Baevski et al.
has pretty much 3 sentences regarding input audio processing if I'm not
mistaken.
Zero Mean and Unit Variance
The first part of wav2vec2.0 audio processing is zero mean and unit variance
normalisation.
The raw waveform input to the encoder is normalized to zero mean and unit variance.
Pre-processing acoustic features to zero mean and unit variance is a common
practice to mitigate data variation (Viiki and Laurila, 1998). I am not sure how effective this pre-processing for neural models is
though.
An example code block below shows mean and variance normalisation of an audio file using HuggingFace Wav2Vec2FeatureExtractor. The mean and variance of processed audio samples are 0.0 and 1.0, respectively.
import numpy as np
import soundfile as sf
from transformers import Wav2Vec2FeatureExtractor
# the same audio file from previous graphs
audio, sr = sf.read("data/LibriSpeech/test-clean/8230/279154/8230-279154-0000.flac")
# 0.00019265260972603013
print(np.mean(audio))
# 0.003338447929336512
print(np.var(audio))
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(
"patrickvonplaten/wav2vec2-base")
processed = feature_extractor(audio, sampling_rate=sr)['input_values'][0]
# -4.1631865e-10; very close to 0.0
print(np.mean(processed))
# 0.99997014 ; very close to 1.0
print(np.var(processed))
Feature Encoding with CNNs
The second stage of wav2vec2.0 audio processing is encoding audio samples
using CNN layers.
The feature encoder contains seven blocks and the temporal convolutions in each block have 512 channels with strides (5,2,2,2,2,2,2) and kernel widths (10,3,3,3,3,2,2). This results in an encoder output frequency of 49 hz with a stride of about 20ms between each sample, and a receptive field of 400 input samples or 25ms of audio.
The first CNN layer takes 10 audio samples as input and projects the samples
into 512 dimensional features. The stride 5 means that the next input is the 5
samples processed and the 5 following samples. These samples are again
transformed into 512 dimensional features.
The second CNN layer takes 3 (kernel size) x 512 dimensional 2D tensors and the only last tensor of the first operation becomes input of the next convolutional operation (stride=2). The figure below illustrates how first and second CNN layers process audio samples.
This operation continues through 3rd to 7th CNN layers and 16,000 audio
samples are encoded into 512 x 49 tensors corresponding to 49hz as stated in the
paper. The code block below demonstrates audio samples of 1 second (16,000 values) are indeed transformed into 49 x 512 features using HuggingFace Wav2Vec2Model.
import torch
from transformers import Wav2Vec2Model
model = Wav2Vec2Model.from_pretrained("patrickvonplaten/wav2vec2-base")
# create 1 second audio but all values are 1
v = torch.ones(16000)
v = v.unsqueeze(0)
# CNN encoders are named as feature_extractor
# torch.Size([1, 512, 49])
model.feature_extractor(v).size()
Conclusion
The wav2vec2.0 model is one of the most important self-supervised speech models. There are many online resources describing the core learning algorithm of wav2vec2.0. Many deep learning toolkits including HuggingFace have made wav2vec2.0 very accessible for ASR and other speech processing tasks. With that said, there is not much coverage of how audio processing is carried out by wav2vec2.0 and this was my motivation to write this post. I hope this post was useful for someone non-speech background to understand how input audio is processed for wav2vec2.0.
Comments
Post a Comment