Acoustic features for speech processing

Introduction

This post summarises acoustic features for various tasks of speech processing.

Automatic speech recognition (ASR) is one of the most studied speech processing tasks. Acoustic features for ASR include Mel-frequency cepstrum coefficients (MFCCs) and spectogram-based features including Mel-spectrograms and Mel-filter banks. The choice of acoustic features depends on a choice of ASR model:
  • Traditional machine learning (ML) models such as Gaussian mixture models (GMMs) have difficulties of handling correlated features and MFCCs are favourite for de-correlated coeffcients.
  • More recent deep learning based models e.g., Conformer use acoustic feature vectors with correlation between neighbour dimensions: filterbanks.
  • A popular model from OpenAI, Whisper, directly takes as input a log-Mel spectrogram which is technically the same representation as filterbanks (will be explained later).
  • General purpose speech models including wav2vec2.0 process raw digital signal samples with convolutional layers.
MFCCs, filterbanks and log-Mel spectrograms are all derived from raw digital signal samples. The choice of feature depends on the model architecture, but understanding the characteristics of each is important.

My previous post describes the raw speech samples for anyone interested: https://yasufumimoriya.blogspot.com/2025/11/digital-signal-processing-basics.html

This post focuses on the process to generate Mel-spectrograms, Mel-filter banks and MFCCs from raw speech samples.

The Whole Picture

The steps to create MFCCs from raw speech:
  • Raw speech samples
    • wav2vec2.0 input
  • Pre-emphasis
  • Windowing
  • Discrete Fourier Transform (DFT)
  • Mel-filtering
  • Logarithm
    • filterbanks -> Conformer input
    • Mel-spectrogram -> Whisper input
  • Discrete Cosine Transform (DCT)
  • MFCCs
    • Gaussian mixture model (GMM) input
The next sub-sections provide code and visualisation of each step using the audio file of the JSUT corpus: BASIC5000_0001.

Pre-emphasis

The speech sounds naturally have higher energies at lower frequencies. This is due to our vocal folds vibration (F0) generating more energies. The amount of energy is generally in descending order of vowels, voiced consonants and voiceless consonants.

Imagine when we ask for help in an emergency situation, we will shout at people "hEEEEEElp", but not "HHHHHHHHelp". Our "h" sound will never be as loud as your "e" sound.

Pre-emphasis boosts energy of a signal at higher frequencies and "flatten" the overall shape of a waveform.
The figure illustrates the effect of pre-emphasis. On the top sub-plot, the orange pre-emphasised waveform is flatter than the original waveform in blue.

The bottom sub-plot shows the power spectrum of 400 samples (20 milliseconds for 16,000 Hz) of a sliced signal. The spectrum of the pre-emphasised orange signal has higher energy at higher frequencies than the original signal without pre-emphasis.

This step helps a machine learning model to think that samples at higher frequencies are human speech sounds rather than random noise captured in an audio file. 

Windowing

The second step of acoustic feature generation is windowing. This process segments a signal into a sequence of fixed length short-time units, called frames. The assumption of this process is that a speech signal is stationary (stable) in a short time period. The resolution of the Discrete Fourier Transform improves by focusing on a speech signal of a short time period with periodic cycles from a single phone.

Let's assume that duration of the spoken word "help" is about 1 second. The frequency components of this single word alone could derive from three consonants (fricative "h", liquid "l" and plosive "p") and one vowel "e". One can imagine the resulting spectrum would be blurry due to having a bunch of frequency bins with high energy.

Windowing uses a tapering (less weighting towards the both edges) window function. The tapering window can avoid abruptly discontinuing a signal which results in noisy output of the Discrete Fourier Transform referred to as spectral leakage. An example of Hamming window is illustrated in the top sub-plot.
The bottom sub-plot shows a short wimdow of a signal to which the above windowing is applied. This short signal has lower amplitudes towards the edges, demonstrating an impact of the tapering window.

A typical length of a speech frame is 25 milliseconds (400 samples for 16,000 Hz signal) with overlap of 10 milliseconds (160 samples for 16,000 Hz signal). 

Discrete Fourier Transform (DFT) and Spectrogram

The Discrete Fourier Transform (DFT) analyses frequency components of each speech frame, resulting in a spectrum. When plotting the spectrum, the x-axis shows frequency bins e.g., 0-100 Hz, 100-200 Hz, .., etc. 

The DFT is a complex process and can be a stand-alone topic by itself. I summarised more details of the DFT in this post: https://yasufumimoriya.blogspot.com/2026/01/discrete-fourier-transform.html.

We can now create a spectrogram of a signal with output of the DFT per speech frame. A spectrogram representation consists of a sequence of spectrum per speech frame, and visualises strengths of each frequency component per frame in colour gradation. As mentioned in Introduction, a spectrogram is input of the popular Whisper model.
The figure looks dark and wrong compared to spectrograms we seen in the librosa page or in the PyTorch page.

Two things:
  • Humans are more sensitive to changes in the lower pitch (frequency) than in the higher pitch.
    • Equal spacing of frequency bins in the spectrogram does not represent the actual human perception of pitch. 
  • The intensity of frequencies is in the amplitude (absolute values of complex numbers derived from the DFT).
    • The raw amplitude has a large gap in values of intensity and most of the spectrograms are dark because of this gap.  
The next two sections apply the Mel-filter and the logarithm to this representation and finally produce acoustic features: log Mel-spectrogram and Mel-filter banks.

Mel-filter

The Mel-scale attempts to copy human perception of the pitch. Experiencing it is the easiest way to understand this.

We can hear the pitch difference between 100 Hz and 105 Hz audio files below.



The audio files below are 2,000 Hz and 2,005 Hz. Probably, you won't hear the pitch difference between these audio files, despite the same 5 Hz difference.



Back to the Mel-scale, the top sub-plot compares the Mel-scale to the linear frequency scale. Up to around 1,000 Hz, the Mel-scale is approximately linear. The Mel-scale starts to grow logarithmically above 1,000 Hz, and this mimics human pitch perception.

The bottom sub-plot shows example 10 Mel-filters (filter banks). Having 10 filters is just for convenience, but usually 40 filters are used for generating the Mel-filter bank feature. As can be seen in the figure, more filters are present at lower frequencies and filters get thicker gradually towards higher frequencies.

Now, we are able to convert the linear frequency scale to the Mel scale that is more aligned to human perception of the pitch changes. 

The top spectrogram is in the linear frequency scale and the bottom one in the Mel frequency scale, both representing the same signal. This comparison demonstrates that the Mel-scale has higher resolution at lower frequencies and bright colour bars of approximately 512 Hz and 1,000 Hz at 3 seconds are thicker than the ones in the linear scale spectrogram.   

Logarithm

The final missing element to achieve abilities of human hearing is the logarithms. Similar to pitch perception, human sensitivity to loudness of sounds is non-linear. 

Output of the DFT is complex values and its amplitude (absolute values of the complex values) is expressed as follows:
\[ |X[k]| = \sqrt{\{ Real(X[k])^2 + Imaginary(X[k])^2 \}} \]

The power spectrum is a squared value of the amplitude: \( |X[k]|^2  \).

The Mel-spectrogram generation of the librosa page uses the decibel scale as the logarithm: https://librosa.org/doc/main/generated/librosa.feature.melspectrogram.html

The decibel is a relative ratio of power spectrum \( |X[k]|^2  \) or amplitude \( |X[k]|  \) over the maximum value of the normalised digital signal in the log-scale (more details of the decibel in this post).

Now, we are ready to produce the log (decibel) Mel-spectrogram.
The top and middle plots are what we have seen so far. The bottom one has a drastic change in brightness of intensities of each frequency bin, by converting values from the amplitude to the decibel.

Terminology of acoustic features is wobbly and there appear to be many types of acoustic features:
  • filterbanks (Conformer)
  • fbank / mel-filterbanks (Kaldi)
  • log-Mel spectrogram (Whisper)
Generally, the spectrogram is used in the context of audio visualisation but mathematically all of the above are the same representation of the signal i.e., output of the DFT in the Mel-scale frequency and the log-scale values.

Implementation differences of log values

This section looks into how major libraries including Kaldi, librosa, PyTorch and Whisper implement its feature generation.

Kaldi

Kaldi is written in C++ but its functions are ported to torchaudio.
Generation of "fbank" features has its use_log_fbank and it seems that values are converted to the natural log.
https://github.com/pytorch/audio/blob/main/src/torchaudio/compliance/kaldi.py#L630-L633 
mel_energies = torch.mm(spectrum, mel_energies.T)
if use_log_fbank:
    # avoid log of zero (which should be prevented anyway by dithering)
    mel_energies = torch.max(mel_energies, _get_epsilon(device, dtype)).log()

librosa

librosa computes Mel-spectrograms in the decibel scale.
https://github.com/librosa/librosa/blob/main/librosa/feature/spectral.py#L1991-L1998
if S is None:
    # multichannel behavior may be different due to relative noise floor differences between channels
    S = power_to_db(melspectrogram(y=y, sr=sr, norm = mel_norm, **kwargs))

fft = get_fftlib()
M: np.ndarray = fft.dct(S, axis=-2, type=dct_type, norm=norm)[
    ..., :n_mfcc, :
]

torchaudio

The native torchaudio takes the decibel or the natural log of the Mel-spectrogram.
https://github.com/faroit/torchaudio/blob/master/torchaudio/transforms.py#L501-L506 
mel_specgram = self.MelSpectrogram(waveform)
if self.log_mels:
    log_offset = 1e-6
    mel_specgram = torch.log(mel_specgram + log_offset)
else:
    mel_specgram = self.amplitude_to_DB(mel_specgram)

Whisper

Whisper takes the log base 10 of the power Mel-scale spectrum and applies additional normalisation to input features. 
https://github.com/openai/whisper/blob/main/whisper/audio.py#L110-L157
log_spec = torch.clamp(mel_spec, min=1e-10).log10()
log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
log_spec = (log_spec + 4.0) / 4.0

Discrete Cosine Transform (DCT)

The Discrete Cosine Transform (DCT) is the final step to create MFCC features. The DCT decorrelates the log Mel-spectrogram / filterbank features and this is helpful for classical machine learning models such as the Gaussian mixture models (GMMs). 

Suppose a single frame of log Mel spectrogram consists of 80 mel frequency bands: [m1, m2, m3, ..., m80]. Commonly neighbouring bins like m1 and m2 to have similar values. In other words, frequency bins of 100-110 Hz and 110-120 Hz have more similar values than frequency bins of 100-110 Hz and 2,000-2,010. This leads to correlation which a classical ML models struggle with.

While the DFT is applied to digital signal samples within a speech frame, the DCT takes the log power or amplitude values of Mel-filters per frame. The input size of the DCT therefore corresponds to the number of Mel-filters.

The DCT is similar to the DFT but uses only cosine functions (real values) for correlation signals instead of both cosine and sine (imaginary values) functions. The formula to apply the DCT is as follows: \[ c_k = \sum_{n=0}^{N-1} m_n \cos\{\pi k(2n+1)/2N\} \]
where \( N \) is the total number of Mel-filter bins, \(m_n\) the log power value of nth Mel-filter and \(c_k\) the value of the kth DCT coefficient.

The first coefficient when \(k = 0\), \(\cos(0)=1\) and: \[c_0 = \sum_{n=0}^{N-1} m_n \]
So the 0th coefficient is the total amount of energy per speech frame.

The first 13 coefficients typically form the MFCC feature for each frame. This is because lower level coefficients correspond to information related to identification of phones. The higher level coefficients can contain more speaker specific information and are often discarded for the task of ASR.

At this point, however, no information between speech frames is included due to the DCT taking each log Mel-filter values per frame independently. It is common to compute the average of neighbouring values (delta) and average of delta values (delta-delta) and this forms the MFCC features of 39 dimensions per speech frame.
The figure shows MFCC features consisting of 13 coefficients per frame. This is quite uninterpretable. 

Summary

This post turned out to be a long one! After all it's always the best to play with the actual signal modifying code for learning and understanding concepts. The Python notebook I used to create this post is available https://github.com/yasumori/blog/blob/main/2026/2026_01_acoustic_features.ipynb.

To summarise, a speech signal goes through several steps to be transformed into acoustic features. The choice of acoustic features depends on a type of machine learning model to learn human speech.

Briefly the steps to create Mel-spectrogram and Mel-filter bank features are:
  • Pre-emphasis and windowing: pre-processing and segmentation of a speech signal into short time overlapping frames
  • DFT: conversion from the time domain to the frequency domain
  • Mel-filtering and logarithms: apply human perception of pitch changes and sound intensity to output of the DFT
  • DCT: conversion from the Mel-frequency domain to the cepstrum domain for de-correlation (a necessary step for ML models including GMMs).

Comments

Popular posts from this blog

Digital Signal Processing Basics

How wav2vec2.0 takes input audio data

SLT 2022 Notes