Acoustic features for speech processing
Introduction
This post summarises acoustic features for various tasks of speech processing.
Automatic speech recognition (ASR) is one of the most studied speech
processing tasks. Acoustic features for ASR include Mel-frequency cepstrum
coefficients (MFCCs) and spectogram-based features including Mel-spectrograms
and Mel-filter banks. The choice of acoustic features depends on a choice of ASR model:
- Traditional machine learning (ML) models such as Gaussian mixture models (GMMs) have difficulties of handling correlated features and MFCCs are favourite for de-correlated coeffcients.
- More recent deep learning based models e.g., Conformer use acoustic feature vectors with correlation between neighbour dimensions: filterbanks.
- A popular model from OpenAI, Whisper, directly takes as input a log-Mel spectrogram which is technically the same representation as filterbanks (will be explained later).
- General purpose speech models including wav2vec2.0 process raw digital signal samples with convolutional layers.
MFCCs, filterbanks and log-Mel spectrograms are all derived from raw digital
signal samples. The choice of feature depends on the model architecture, but
understanding the characteristics of each is important.
My previous post describes the raw speech samples for anyone interested: https://yasufumimoriya.blogspot.com/2025/11/digital-signal-processing-basics.html
This post focuses on the process to generate Mel-spectrograms, Mel-filter banks and MFCCs from raw speech samples.
The Whole Picture
The steps to create MFCCs from raw speech:
- Raw speech samples
- wav2vec2.0 input
- Pre-emphasis
- Windowing
- Discrete Fourier Transform (DFT)
- Mel-filtering
- Logarithm
- filterbanks -> Conformer input
- Mel-spectrogram -> Whisper input
- Discrete Cosine Transform (DCT)
- MFCCs
- Gaussian mixture model (GMM) input
The next sub-sections provide code and visualisation of each step using the
audio file of
the JSUT corpus: BASIC5000_0001.
Pre-emphasis
The speech sounds naturally have higher energies at lower frequencies. This is
due to our vocal folds vibration (F0) generating more energies. The amount of
energy is generally in descending order of vowels, voiced consonants and
voiceless consonants.
Imagine when we ask for help in an emergency situation, we will shout at
people "hEEEEEElp", but not "HHHHHHHHelp". Our "h" sound will never be as
loud as your "e" sound.
Pre-emphasis boosts energy of a signal at higher frequencies and "flatten" the
overall shape of a waveform.
The figure illustrates the effect of pre-emphasis. On the top sub-plot, the
orange pre-emphasised waveform is flatter than the original waveform in blue.
The bottom sub-plot shows the power spectrum of 400 samples (20 milliseconds
for 16,000 Hz) of a sliced signal. The spectrum of the pre-emphasised orange
signal has higher energy at higher frequencies than the original signal
without pre-emphasis.
This step helps a machine learning model to think that samples at higher
frequencies are human speech sounds rather than random noise captured in an
audio file.
Windowing
The second step of acoustic feature generation is windowing. This process
segments a signal into a sequence of fixed length short-time units,
called frames. The assumption of this process is that a speech
signal is stationary (stable) in a short time period. The resolution of the
Discrete Fourier Transform improves by focusing on a speech signal of a short
time period with periodic cycles from a single phone.
Let's assume that duration of the spoken word "help" is about 1 second. The
frequency components of this single word alone could derive from three
consonants (fricative "h", liquid "l" and plosive "p") and one vowel "e". One
can imagine the resulting spectrum would be blurry due to having a bunch of
frequency bins with high energy.
Windowing uses a tapering (less weighting towards the both edges) window
function. The tapering window can avoid abruptly discontinuing a signal which
results in noisy output of the Discrete Fourier Transform referred to as
spectral leakage. An example of Hamming window is illustrated in the top
sub-plot.
The bottom sub-plot shows a short wimdow of a signal to which the above
windowing is applied. This short signal has lower amplitudes towards the edges,
demonstrating an impact of the tapering window.
A typical length of a speech frame is 25 milliseconds (400 samples for 16,000
Hz signal) with overlap of 10 milliseconds (160 samples for 16,000 Hz
signal).
Discrete Fourier Transform (DFT) and Spectrogram
The Discrete Fourier Transform (DFT) analyses frequency components of each
speech frame, resulting in a spectrum. When plotting the spectrum, the
x-axis shows frequency bins e.g., 0-100 Hz, 100-200 Hz, .., etc.
The DFT is a complex process and can be a stand-alone topic by itself. I
summarised more details of the DFT in this post: https://yasufumimoriya.blogspot.com/2026/01/discrete-fourier-transform.html.
We can now create a spectrogram of a signal with output of the DFT per speech
frame. A spectrogram representation consists of a sequence of spectrum per
speech frame, and visualises strengths of each frequency component per frame
in colour gradation. As mentioned in Introduction, a spectrogram is input of
the popular Whisper model.
The figure looks dark and wrong compared to spectrograms we seen in
the librosa page
or in
the PyTorch page.
Two things:
- Humans are more sensitive to changes in the lower pitch (frequency) than in the higher pitch.
- Equal spacing of frequency bins in the spectrogram does not represent the actual human perception of pitch.
- The intensity of frequencies is in the amplitude (absolute values of complex numbers derived from the DFT).
- The raw amplitude has a large gap in values of intensity and most of the spectrograms are dark because of this gap.
The next two sections apply the Mel-filter and the logarithm to this
representation and finally produce acoustic features: log Mel-spectrogram
and Mel-filter banks.
Mel-filter
The Mel-scale attempts to copy human perception of the pitch.
Experiencing it is the easiest way to understand this.
We can hear the pitch difference between 100 Hz and 105 Hz audio files
below.
The audio files below are 2,000 Hz and 2,005 Hz. Probably, you won't
hear the pitch difference between these audio files, despite the same 5
Hz difference.
Back to the Mel-scale, the top sub-plot compares the Mel-scale to the
linear frequency scale. Up to around 1,000 Hz, the Mel-scale is
approximately linear. The Mel-scale starts to grow logarithmically above
1,000 Hz, and this mimics human pitch perception.
The bottom sub-plot shows example 10 Mel-filters (filter banks). Having
10 filters is just for convenience, but usually 40 filters are used for
generating the Mel-filter bank feature. As can be seen in the figure,
more filters are present at lower frequencies and filters get thicker
gradually towards higher frequencies.
Now, we are able to convert the linear frequency scale to the Mel scale
that is more aligned to human perception of the pitch changes.
The top spectrogram is in the linear frequency scale and the bottom one in the Mel frequency scale, both representing the same signal. This comparison demonstrates that the Mel-scale has higher resolution at lower frequencies and bright colour bars of approximately 512 Hz and 1,000 Hz at 3 seconds are thicker than the ones in the linear scale spectrogram.
Logarithm
Output of the DFT is complex values and its amplitude (absolute values of
the complex values) is expressed as follows:
\[ |X[k]| = \sqrt{\{ Real(X[k])^2 + Imaginary(X[k])^2 \}} \]
The power spectrum is a squared value of the amplitude: \( |X[k]|^2
\).
The Mel-spectrogram generation of the
librosa page uses the decibel
scale as the logarithm: https://librosa.org/doc/main/generated/librosa.feature.melspectrogram.html.
The decibel is a relative ratio of power spectrum \( |X[k]|^2 \) or
amplitude \( |X[k]| \) over the maximum value of the normalised
digital signal in the log-scale (more details of the decibel in this post).
Now, we are ready to produce the log (decibel) Mel-spectrogram.
Terminology of acoustic features is wobbly and there appear to be many types
of acoustic features:
- filterbanks (Conformer)
- fbank / mel-filterbanks (Kaldi)
- log-Mel spectrogram (Whisper)
Generally, the spectrogram is used in the context of audio visualisation but
mathematically all of the above are the same representation of the signal
i.e., output of the DFT in the Mel-scale frequency and the log-scale values.
Implementation differences of log values
This section looks into how major libraries including Kaldi, librosa,
PyTorch and Whisper implement its feature generation.
Kaldi
Kaldi is written in C++ but its functions are ported to torchaudio.
Generation of "fbank" features has its
use_log_fbank and it seems that values
are converted to the natural log.
https://github.com/pytorch/audio/blob/main/src/torchaudio/compliance/kaldi.py#L630-L633
mel_energies = torch.mm(spectrum, mel_energies.T)
if use_log_fbank:
# avoid log of zero (which should be prevented anyway by dithering)
mel_energies = torch.max(mel_energies, _get_epsilon(device, dtype)).log()
librosa
librosa computes Mel-spectrograms in the decibel scale.
https://github.com/librosa/librosa/blob/main/librosa/feature/spectral.py#L1991-L1998
if S is None:
# multichannel behavior may be different due to relative noise floor differences between channels
S = power_to_db(melspectrogram(y=y, sr=sr, norm = mel_norm, **kwargs))
fft = get_fftlib()
M: np.ndarray = fft.dct(S, axis=-2, type=dct_type, norm=norm)[
..., :n_mfcc, :
]
torchaudio
The native torchaudio takes the decibel or the
natural log of the Mel-spectrogram.
https://github.com/faroit/torchaudio/blob/master/torchaudio/transforms.py#L501-L506
mel_specgram = self.MelSpectrogram(waveform)
if self.log_mels:
log_offset = 1e-6
mel_specgram = torch.log(mel_specgram + log_offset)
else:
mel_specgram = self.amplitude_to_DB(mel_specgram)
Whisper
Whisper takes the log base 10 of the power Mel-scale spectrum and
applies additional normalisation to input features.
https://github.com/openai/whisper/blob/main/whisper/audio.py#L110-L157
log_spec = torch.clamp(mel_spec, min=1e-10).log10()
log_spec = torch.maximum(log_spec, log_spec.max() - 8.0)
log_spec = (log_spec + 4.0) / 4.0Discrete Cosine Transform (DCT)
The Discrete Cosine Transform (DCT) is the final step to create MFCC features. The DCT decorrelates the log Mel-spectrogram / filterbank features and this is helpful for classical machine learning models such as the Gaussian mixture models (GMMs).
Suppose a single frame of log Mel spectrogram consists of 80 mel frequency bands: [m1, m2, m3, ..., m80]. Commonly neighbouring bins like m1 and m2 to have similar values. In other words, frequency bins of 100-110 Hz and 110-120 Hz have more similar values than frequency bins of 100-110 Hz and 2,000-2,010. This leads to correlation which a classical ML models struggle with.
While the DFT is applied to digital signal samples within a speech frame, the DCT takes the log power or amplitude values of Mel-filters per frame. The input size of the DCT therefore corresponds to the number of Mel-filters.
The DCT is similar to the DFT but uses only cosine functions (real values) for correlation signals instead of both cosine and sine (imaginary values) functions. The formula to apply the DCT is as follows: \[ c_k = \sum_{n=0}^{N-1} m_n \cos\{\pi k(2n+1)/2N\} \]
where \( N \) is the total number of Mel-filter bins, \(m_n\) the log power value of nth Mel-filter and \(c_k\) the value of the kth DCT coefficient.
The first coefficient when \(k = 0\), \(\cos(0)=1\) and: \[c_0 = \sum_{n=0}^{N-1} m_n \]
So the 0th coefficient is the total amount of energy per speech frame.
The first 13 coefficients typically form the MFCC feature for each frame. This is because lower level coefficients correspond to information related to identification of phones. The higher level coefficients can contain more speaker specific information and are often discarded for the task of ASR.
At this point, however, no information between speech frames is included due to the DCT taking each log Mel-filter values per frame independently. It is common to compute the average of neighbouring values (delta) and average of delta values (delta-delta) and this forms the MFCC features of 39 dimensions per speech frame.
The figure shows MFCC features consisting of 13 coefficients per frame. This is quite uninterpretable.
Summary
This post turned out to be a long one! After all it's always the best
to play with the actual signal modifying code for learning and understanding
concepts. The Python notebook I used to create this post is available
https://github.com/yasumori/blog/blob/main/2026/2026_01_acoustic_features.ipynb.
To summarise, a speech signal goes through several steps to be transformed
into acoustic features. The choice of acoustic features depends on a type of
machine learning model to learn human speech.
Briefly the steps to create Mel-spectrogram and Mel-filter bank features are:
- Pre-emphasis and windowing: pre-processing and segmentation of a speech signal into short time overlapping frames
- DFT: conversion from the time domain to the frequency domain
- Mel-filtering and logarithms: apply human perception of pitch changes and sound intensity to output of the DFT
- DCT: conversion from the Mel-frequency domain to the cepstrum domain for de-correlation (a necessary step for ML models including GMMs).







Comments
Post a Comment