How wav2vec2.0 takes input audio data
Introduction Building an automatic speech recognition (ASR) system has become easier than ever thanks to the availability of deep learning and ASR toolkits. These toolkits magically pre-process an audio file and a neural model can seamlessly take processed audio features, letting users needing not to know much details. This is my motivation to write a summary of how audio files are processed for a neural model. In this post, I will focus on wav2vec2.0 proposed by Baevski et al. which learns speech representations in a self-supervised manner. This post from Hugging Face is an excellent tutorial showing how to fine-tune wav2vec2.0 for the ASR task. I would like to provide information focusing only on input structure of wav2vec2.0 to compliment the tutorial. Fig1: Audio input processing of wav2vec2.0 discussed in this post is the part circled by a red line. This figure is taken from Baevski et