Posts

Basic Probability

Image
Introduction Probability theory is the foundation of machine learning. Knowledge of machine learning is a requirement for working on a speech and language processing project today. So, probability theory is essential for speech and language processing projects! The objective of this post is to refresh my knowledge of probability theory. I am keen to connect probability theory with real world examples, and to avoid throwing a bunch of theoretical definitions. Feel free to leave comments if my writing is incorrect.   Probability The probability is a chance of an occurrence of an event. The probability is a value between 0 and 1. In contrast, human words are not mathematical. Even if I say "I'll go to Paris next year, 100%", I might not go to Paris100% next year. When I was in Paris last time. The theoretical and mathematical probability has to be precise unlike human words. A ...

Acoustic features for speech processing

Image
Introduction This post summarises acoustic features for various tasks of speech processing. Automatic speech recognition (ASR) is one of the most studied speech processing tasks. Acoustic features for ASR include Mel-frequency cepstrum coefficients (MFCCs) and spectogram-based features including Mel-spectrograms and Mel-filter banks. The choice of acoustic features depends on a choice of ASR model: Traditional machine learning (ML) models such as Gaussian mixture models (GMMs) have difficulties of handling correlated features and MFCCs are favourite for de-correlated coeffcients. More recent deep learning based models e.g., Conformer use acoustic feature vectors with correlation between neighbour dimensions: filterbanks. A popular model from OpenAI, Whisper , directly takes as input a log-Mel spectrogram which is technically the same representation as filterbanks (will be explained later). ...

Visualising a speech signal

Image
Speech Visualisation This post covers visualisation of a speech signal: plotting a waveform, annotating a waveform and showing speech spectrums. I am using the first speech file (BASIC5000_0001) of the JSUT corpus  that consists of 10 hour recordings of a Japanese female speaker. JSUT ver 1.1 BASIC5000_0001 My code is all written in this Python notebook: https://github.com/yasumori/blog/blob/main/2026/2026_01_visualisation.ipynb . You should be able to run it after installing required libraries: librosa, matplotlib, and numpy. The first speech file is also uploaded to my GitHub, following the terms of use "Re-distribution is not permitted, but you can upload a part of this corpus (e.g., ~100 audio files) in your website or blog". import librosa import subprocess # load audio in 16kHz signal, sr = librosa.load("./data/BASIC5000_0001.wav", sr=16000) print(f"number of samples: {len(signal)}") print(f"duration {len(signal)/sr} ...

Decibel and Logarithms

Image
Introduction The decibel is a unit to express loudness of sounds, and an important measurement in sound processing. The decibel is the logarithmic scale of sound intensity. The reason to use the logarithm is that human hearing is logarithmic rather than linear. The decibel is also not an absolute metric but a relative ratio of intensity of one sound compared to another. This part is very confusing because we think that familiar measurements like lengths "cm, m, km" and weight "g, kg..." are absolute units. There are many online resources that explain the logarithms or the decibel. I don't see resources that explain both the logarithms and the decibel. This is my motivation to create this post: keeping information about the logarithms and the decibel in one page.  Logarithms The logarithms are inverse operation of an exponent (power). \[ 2^3 = 8 \] \[ \log_28 = 3 \] The log of 8 to the base 2 is 3 , and 2 to...

Discrete Fourier Transform

Image
Discrete Fourier Transform The last part of the previous post mentions the method to find frequencies in a signal: the Discrete Fourier Transform (DFT). This post dives deeper into the DFT. The main idea of the DFT is to find out which frequency component correlates with the given input signal . This mathematical formula looks scary. \[ X[k] = \sum_{n=0}^{N-1} x[n] \, e^{-j2\pi k n / N} \] k = current frequency to check correlation n = current sample N = number of samples x[n] = the value of the current sample of the given signal My goal in this post is to demonstrate what the DFT performs is simple. Input signal and correlation signal Let's say the input signal is a sine wave of 2 Hz. There are 30 samples to represent this signal. The input signal of 2 Hz should show the highest correlation at 2 Hz. Let's also have 5 different correlation signals varying from 0 Hz to 5 Hz. The first correlation signal has i...