Posts

Acoustic features for speech processing

Image
Introduction This post summarises acoustic features for various tasks of speech processing. Automatic speech recognition (ASR) is one of the most studied speech processing tasks. Acoustic features for ASR include Mel-frequency cepstrum coefficients (MFCCs) and spectogram-based features including Mel-spectrograms and Mel-filter banks. The choice of acoustic features depends on a choice of ASR model: Traditional machine learning (ML) models such as Gaussian mixture models (GMMs) have difficulties of handling correlated features and MFCCs are favourite for de-correlated coeffcients. More recent deep learning based models e.g., Conformer use acoustic feature vectors with correlation between neighbour dimensions: filterbanks. A popular model from OpenAI, Whisper , directly takes as input a log-Mel spectrogram which is technically the same representation as filterbanks (will be explained later). ...

Visualising a speech signal

Image
Speech Visualisation This post covers visualisation of a speech signal: plotting a waveform, annotating a waveform and showing speech spectrums. I am using the first speech file (BASIC5000_0001) of the JSUT corpus  that consists of 10 hour recordings of a Japanese female speaker. JSUT ver 1.1 BASIC5000_0001 My code is all written in this Python notebook: https://github.com/yasumori/blog/blob/main/2026/2026_01_visualisation.ipynb . You should be able to run it after installing required libraries: librosa, matplotlib, and numpy. The first speech file is also uploaded to my GitHub, following the terms of use "Re-distribution is not permitted, but you can upload a part of this corpus (e.g., ~100 audio files) in your website or blog". import librosa import subprocess # load audio in 16kHz signal, sr = librosa.load("./data/BASIC5000_0001.wav", sr=16000) print(f"number of samples: {len(signal)}") print(f"duration {len(signal)/sr} ...

Decibel and Logarithms

Image
Introduction The decibel is a unit to express loudness of sounds, and an important measurement in sound processing. The decibel is the logarithmic scale of sound intensity. The reason to use the logarithm is that human hearing is logarithmic rather than linear. The decibel is also not an absolute metric but a relative ratio of intensity of one sound compared to another. This part is very confusing because we think that familiar measurements like lengths "cm, m, km" and weight "g, kg..." are absolute units. There are many online resources that explain the logarithms or the decibel. I don't see resources that explain both the logarithms and the decibel. This is my motivation to create this post: keeping information about the logarithms and the decibel in one page.  Logarithms The logarithms are inverse operation of an exponent (power). \[ 2^3 = 8 \] \[ \log_28 = 3 \] The log of 8 to the base 2 is 3 , and 2 to...

Discrete Fourier Transform

Image
Discrete Fourier Transform The last part of the previous post mentions the method to find frequencies in a signal: the Discrete Fourier Transform (DFT). This post dives deeper into the DFT. The main idea of the DFT is to find out which frequency component correlates with the given input signal . This mathematical formula looks scary. \[ X[k] = \sum_{n=0}^{N-1} x[n] \, e^{-j2\pi k n / N} \] k = current frequency to check correlation n = current sample N = number of samples x[n] = the value of the current sample of the given signal My goal in this post is to demonstrate what the DFT performs is simple. Input signal and correlation signal Let's say the input signal is a sine wave of 2 Hz. There are 30 samples to represent this signal. The input signal of 2 Hz should show the highest correlation at 2 Hz. Let's also have 5 different correlation signals varying from 0 Hz to 5 Hz. The first correlation signal has i...

Sound frequency

Image
Introduction The signal frequency is the "pitch" of the sound. Some facts about sound frequencies you might encounter in a pub quiz... Typically, male voices range from 85 to 180 Hz and female voices from 165 to 255 Hz .  Humans can easily hear the sound frequency up to 8,000 Hz and lose abilities to hear sounds beyond that frequency through age. The music note C is 261.63 Hz and E 329.63 Hz. The Python notebook is a convenient playground to generate sounds of those frequencies and listen to the sounds. https://github.com/yasumori/blog/blob/main/2025/2025_12_21_signal2.ipynb . An example code snipet to generate a 2,000 Hz sound is also below: import numpy as np from IPython import display def gen_audio(frequency, duration, sample_rate): t = np.linspace(0, duration, duration * sample_rate) return np.sin(2 * np.pi * frequency * t) hz_2000 = gen_audio(2000, 3, 44100) disp...

Digital Signal Processing Basics

Image
Acoustic Signal I had a hard time understanding basics of digital signal processing. I think that the reason was the fact that sounds were not visible. There is a way to "see" the sound, though. This YouTube video for example demonstrates that special cameras make the sound visible.   The sound is the virbation of particles in the air. Something invisible surrounding us rapidly moves back and forth. And our ears can hear this movement. I am writing this post to cover very basics of digital signal processing that I was very slow to understand. I will write concepts of Analogue-to-Digital conversion, sampling and Nyquist frequency. Analogue vs Digital Apart from the sound not usually visible to us, I think that I initially mixed up the concept of the digital signal with the analogue signal. I was still new to the idea that computers represent everything in discrete numbers. Two important things: An analogue s...