Posts

How wav2vec2.0 takes input audio data

Image
Introduction Building an automatic speech recognition (ASR) system has become easier than ever thanks to the availability of deep learning and ASR toolkits. These toolkits magically pre-process an audio file and a neural model can seamlessly take processed audio features, letting users needing not to know much details. This is my motivation to write a summary of how audio files are processed for a neural model. In this post, I will focus on  wav2vec2.0  proposed by Baevski et al. which learns speech representations in a self-supervised manner. This  post  from Hugging Face is an excellent tutorial showing how to fine-tune wav2vec2.0 for the ASR task. I would like to provide information focusing only on input structure of wav2vec2.0 to compliment the tutorial. Fig1: Audio input processing of wav2vec2.0 discussed in this post is the part circled by a red line. This figure is taken from Baevski et

SLT 2022 Notes

Image
Attending an Offline Conference Again  I was fortunate to have an opportunity to attend the Spoken Language Technology workshop 2022, held from 9th to 12th January in Doha, Qatar. I presented my final PhD work at the conference. Going back to an (almost) in-person conference was so refreshing and I learned so many things. In this post, I'll summarise my takeaways from the SLT 2022 conference. The photo below is the conference venue, Marsa Malaz Kempinski.  Human Speech Processing and Neural Networks As a linguistics student (back in 2012), I've been always keen to learn studies on relevance of human language processing to computer speech processing. On this topic, I remember one of the keynote presentations and one poster. The first keynote talk by Nima Mesgarani was interesting, mentioning that the shallow layers of an RNN-T encoder learns acoustic and phonetic information occurring in short time frames while the later layers capture lexical and semantic information which can

Setup git pre-push hook

Motivation I often forget to run tests before pushing my commits. This is a bad practice. Someone might use my broken code in a collaborative environment. I was looking for a way to automatically run tests before creating a commit or pushing commits. The solution is Git Hooks:  Customizing Git Hooks . Git Hooks Many hooks can be setup in a Git repository. For example, Git Hooks can run before creating a commit ( pre-commit ), after creating a commit ( post-commit ) and before pushing commits to a remote branch ( pre-push ). All of these are client-side hooks and client-side hooks are not shared across local repositories. The other type of hooks are server-side hooks  that run after commits are pushed to a remote server and policy is imposed on all users of the repository. I have setup  pre-push  in a Git repository to avoid forgetting to run tests before sending updates to the remote server. Setup pre-push The way to setup  pre-push  is as follows:

Understanding kaldi lattice-prune and beam

Introduction For a long time, I have been thinking that Kaldi lattice-prune  uses a beam which keeps the specified number of candidates alive. How dare me. I finally took a time to properly understand this.  There are several excellent Kaldi notes (for example Josh Meyer's blog ), but I could not find information specifically about  lattice-prune . My intension of this blog post is to leave information about Kaldi lattice-prune and --beam (assuming that someone uses Kaldi or hybrid ASR systems in 2022). TL;DR  lattice-prune --beam=4  for example deletes all paths of a lattice that exceed <best_path_cost> + 4, as can be seen in this line of the source code . Explanation To understand how Kaldi beam works, I took one of the utterances in the LibriSpeech dev set: 1272-128104-0000 . The lattice of this utterance should be in  exp/chain_cleaned/tdnn_1d_sp/decode_dev_clean_tgsmall/lat.1.gz  once the script  kaldi/egs/librispeech/