How wav2vec2.0 takes input audio data
Introduction     Building an automatic speech recognition (ASR) system has become easier than   ever thanks to the availability of deep learning and ASR toolkits. These   toolkits magically pre-process an audio file and a neural model can seamlessly   take processed audio features, letting users needing not to know much details.      This is my motivation to write a summary of how audio files are processed for   a neural model. In this post, I will focus on  wav2vec2.0  proposed by Baevski et al. which learns speech representations in a   self-supervised manner. This  post  from Hugging Face is an excellent tutorial showing how to fine-tune   wav2vec2.0 for the ASR task. I would like to provide information focusing only   on input structure of wav2vec2.0 to compliment the tutorial.                                                           Fig1: Audio input processing of wav2vec2.0 discussed in this post is the         part circled by a red line. This figure is taken ...