Posts

Showing posts from February, 2026

Basic Probability

Image
Introduction Probability theory is the foundation of machine learning. Knowledge of machine learning is a requirement for working on a speech and language processing project today. So, probability theory is essential for speech and language processing projects! The objective of this post is to refresh my knowledge of probability theory. I am keen to connect probability theory with real world examples, and to avoid throwing a bunch of theoretical definitions. Feel free to leave comments if my writing is incorrect.   Probability The probability is a chance of an occurrence of an event. The probability is a value between 0 and 1. In contrast, human words are not mathematical. Even if I say "I'll go to Paris next year, 100%", I might not go to Paris100% next year. When I was in Paris last time. The theoretical and mathematical probability has to be precise unlike human words. A ...

Acoustic features for speech processing

Image
Introduction This post summarises acoustic features for various tasks of speech processing. Automatic speech recognition (ASR) is one of the most studied speech processing tasks. Acoustic features for ASR include Mel-frequency cepstrum coefficients (MFCCs) and spectogram-based features including Mel-spectrograms and Mel-filter banks. The choice of acoustic features depends on a choice of ASR model: Traditional machine learning (ML) models such as Gaussian mixture models (GMMs) have difficulties of handling correlated features and MFCCs are favourite for de-correlated coeffcients. More recent deep learning based models e.g., Conformer use acoustic feature vectors with correlation between neighbour dimensions: filterbanks. A popular model from OpenAI, Whisper , directly takes as input a log-Mel spectrogram which is technically the same representation as filterbanks (will be explained later). ...

Visualising a speech signal

Image
Speech Visualisation This post covers visualisation of a speech signal: plotting a waveform, annotating a waveform and showing speech spectrums. I am using the first speech file (BASIC5000_0001) of the JSUT corpus  that consists of 10 hour recordings of a Japanese female speaker. JSUT ver 1.1 BASIC5000_0001 My code is all written in this Python notebook: https://github.com/yasumori/blog/blob/main/2026/2026_01_visualisation.ipynb . You should be able to run it after installing required libraries: librosa, matplotlib, and numpy. The first speech file is also uploaded to my GitHub, following the terms of use "Re-distribution is not permitted, but you can upload a part of this corpus (e.g., ~100 audio files) in your website or blog". import librosa import subprocess # load audio in 16kHz signal, sr = librosa.load("./data/BASIC5000_0001.wav", sr=16000) print(f"number of samples: {len(signal)}") print(f"duration {len(signal)/sr} ...