Posts

Discrete Fourier Transform

Image
Discrete Fourier Transform The last part of the previous post mentions the method to find frequencies in a signal: the Discrete Fourier Transform (DFT). This post dives deeper into the DFT. The main idea of the DFT is to find out which frequency component correlates with the given input signal . This mathematical formula looks scary. \[ X[k] = \sum_{n=0}^{N-1} x[n] \, e^{-j2\pi k n / N} \] k = current frequency to check correlation n = current sample N = number of samples x[n] = the value of the current sample of the given signal My goal in this post is to demonstrate what the DFT performs is simple. Input signal and correlation signal Let's say the input signal is a sine wave of 2 Hz. There are 30 samples to represent this signal. The input signal of 2 Hz should show the highest correlation at 2 Hz. Let's also have 5 different correlation signals varying from 0 Hz to 5 Hz. The first correlation signal has i...

Sound frequency

Image
Introduction The signal frequency is the "pitch" of the sound. Some facts about sound frequencies you might encounter in a pub quiz... Typically, male voices range from 85 to 180 Hz and female voices from 165 to 255 Hz .  Humans can easily hear the sound frequency up to 8,000 Hz and lose abilities to hear sounds beyond that frequency through age. The music note C is 261.63 Hz and E 329.63 Hz. The Python notebook is a convenient playground to generate sounds of those frequencies and listen to the sounds. https://github.com/yasumori/blog/blob/main/2025/2025_12_21_signal2.ipynb . An example code snipet to generate a 2,000 Hz sound is also below: import numpy as np from IPython import display def gen_audio(frequency, duration, sample_rate): t = np.linspace(0, duration, duration * sample_rate) return np.sin(2 * np.pi * frequency * t) hz_2000 = gen_audio(2000, 3, 44100) disp...

Digital Signal Processing Basics

Image
Acoustic Signal I had a hard time understanding basics of digital signal processing. I think that the reason was the fact that sounds were not visible. There is a way to "see" the sound, though. This YouTube video for example demonstrates that special cameras make the sound visible.   The sound is the virbation of particles in the air. Something invisible surrounding us rapidly moves back and forth. And our ears can hear this movement. I am writing this post to cover very basics of digital signal processing that I was very slow to understand. I will write concepts of Analogue-to-Digital conversion, sampling and Nyquist frequency. Analogue vs Digital Apart from the sound not usually visible to us, I think that I initially mixed up the concept of the digital signal with the analogue signal. I was still new to the idea that computers represent everything in discrete numbers. Two important things: An analogue s...

How wav2vec2.0 takes input audio data

Image
Introduction Building an automatic speech recognition (ASR) system has become easier than ever thanks to the availability of deep learning and ASR toolkits. These toolkits magically pre-process an audio file and a neural model can seamlessly take processed audio features, letting users needing not to know much details. This is my motivation to write a summary of how audio files are processed for a neural model. In this post, I will focus on  wav2vec2.0  proposed by Baevski et al. which learns speech representations in a self-supervised manner. This  post  from Hugging Face is an excellent tutorial showing how to fine-tune wav2vec2.0 for the ASR task. I would like to provide information focusing only on input structure of wav2vec2.0 to compliment the tutorial. Fig1: Audio input processing of wav2vec2.0 discussed in this post is the part circled by a red line. This figure is taken ...

SLT 2022 Notes

Image
Attending an Offline Conference Again  I was fortunate to have an opportunity to attend the Spoken Language Technology workshop 2022, held from 9th to 12th January in Doha, Qatar. I presented my final PhD work at the conference. Going back to an (almost) in-person conference was so refreshing and I learned so many things. In this post, I'll summarise my takeaways from the SLT 2022 conference. The photo below is the conference venue, Marsa Malaz Kempinski.  Human Speech Processing and Neural Networks As a linguistics student (back in 2012), I've been always keen to learn studies on relevance of human language processing to computer speech processing. On this topic, I remember one of the keynote presentations and one poster. The first keynote talk by Nima Mesgarani was interesting, mentioning that the shallow layers of an RNN-T encoder learns acoustic and phonetic information occurring in short time frames while the later layers capture lexical and semantic information which ca...

Setup git pre-push hook

Motivation I often forget to run tests before pushing my commits. This is a bad practice. Someone might use my broken code in a collaborative environment. I was looking for a way to automatically run tests before creating a commit or pushing commits. The solution is Git Hooks:  Customizing Git Hooks . Git Hooks Many hooks can be setup in a Git repository. For example, Git Hooks can run before creating a commit ( pre-commit ), after creating a commit ( post-commit ) and before pushing commits to a remote branch ( pre-push ). All of these are client-side hooks and client-side hooks are not shared across local repositories. The other type of hooks are server-side hooks  that run after commits are pushed to a remote server and policy is imposed on all users of the repository. I have setup  pre-push  in a Git repository to avoid forgetting to run tests before sending updates to the remote server. Setup pre-push The way to setup  pr...

Understanding kaldi lattice-prune and beam

Introduction For a long time, I have been thinking that Kaldi lattice-prune  uses a beam which keeps the specified number of candidates alive. How dare me. I finally took a time to properly understand this.  There are several excellent Kaldi notes (for example Josh Meyer's blog ), but I could not find information specifically about  lattice-prune . My intension of this blog post is to leave information about Kaldi lattice-prune and --beam (assuming that someone uses Kaldi or hybrid ASR systems in 2022). TL;DR  lattice-prune --beam=4  for example deletes all paths of a lattice that exceed <best_path_cost> + 4, as can be seen in this line of the source code . Explanation To understand how Kaldi beam works, I took one of the utterances in the LibriSpeech dev set: 1272-128104-0000 . The lattice of this utterance should be in  exp/chain_cleaned/tdnn_1d_sp/decode_dev_clean_tgsmall/lat.1.gz  once the ...