Posts

Showing posts from February, 2023

How wav2vec2.0 takes input audio data

Image
Introduction Building an automatic speech recognition (ASR) system has become easier than ever thanks to the availability of deep learning and ASR toolkits. These toolkits magically pre-process an audio file and a neural model can seamlessly take processed audio features, letting users needing not to know much details. This is my motivation to write a summary of how audio files are processed for a neural model. In this post, I will focus on  wav2vec2.0  proposed by Baevski et al. which learns speech representations in a self-supervised manner. This  post  from Hugging Face is an excellent tutorial showing how to fine-tune wav2vec2.0 for the ASR task. I would like to provide information focusing only on input structure of wav2vec2.0 to compliment the tutorial. Fig1: Audio input processing of wav2vec2.0 discussed in this post is the part circled by a red line. This figure is taken from Baevski et

SLT 2022 Notes

Image
Attending an Offline Conference Again  I was fortunate to have an opportunity to attend the Spoken Language Technology workshop 2022, held from 9th to 12th January in Doha, Qatar. I presented my final PhD work at the conference. Going back to an (almost) in-person conference was so refreshing and I learned so many things. In this post, I'll summarise my takeaways from the SLT 2022 conference. The photo below is the conference venue, Marsa Malaz Kempinski.  Human Speech Processing and Neural Networks As a linguistics student (back in 2012), I've been always keen to learn studies on relevance of human language processing to computer speech processing. On this topic, I remember one of the keynote presentations and one poster. The first keynote talk by Nima Mesgarani was interesting, mentioning that the shallow layers of an RNN-T encoder learns acoustic and phonetic information occurring in short time frames while the later layers capture lexical and semantic information which can