Visualising a speech signal

Speech Visualisation

This post covers visualisation of a speech signal: plotting a waveform, annotating a waveform and showing speech spectrums.

I am using the first speech file (BASIC5000_0001) of the JSUT corpus that consists of 10 hour recordings of a Japanese female speaker.

JSUT ver 1.1 BASIC5000_0001

My code is all written in this Python notebook: https://github.com/yasumori/blog/blob/main/2026/2026_01_visualisation.ipynb. You should be able to run it after installing required libraries: librosa, matplotlib, and numpy. The first speech file is also uploaded to my GitHub, following the terms of use "Re-distribution is not permitted, but you can upload a part of this corpus (e.g., ~100 audio files) in your website or blog".

import librosa
import subprocess
# load audio in 16kHz
signal, sr = librosa.load("./data/BASIC5000_0001.wav", sr=16000)
print(f"number of samples: {len(signal)}")
print(f"duration {len(signal)/sr} seconds")

# run Soxi to verify duration
soxi_out = subprocess.run(
    ["soxi", "-D", "./data/BASIC5000_0001.wav"],
    capture_output=True,
    text=True,
    encoding='utf-8'
)
soxi_duration = soxi_out.stdout.rstrip("\n")
print(f"duration from soxi {soxi_duration} seconds")
print(f"min value {min(signal)}")
print(f"max value {max(signal)}")

Output number of samples: 51040
duration 3.19 seconds
duration from soxi 3.190000 seconds
min value -0.36245694756507874
max value 0.30467501282691956

The number of samples is 51,040 and the sampling rate of this signal is 16,000, that means there are 16,000 samples per second.

51040 samples / 16000 samples per second = 3.19 seconds

This sounds about right.

The code block runs soxi to compute the duration in a subprocess, and the soxi duration agrees with the Python duration.

The minimum value of the signal is -0.36 and the maximum value is 0.30. The audio signal is often normalised between -1.0 and 1.0, so this looks good too.

Waveform and annotation

librosa and matplotlib visualise a waveform very easily.

import librosa
import librosa.display
import matplotlib.pyplot as plt
sr = 16000
# Load the audio file
signal, sr = librosa.load("./data/BASIC5000_0001.wav", sr=sr)
plt.figure(figsize=(12, 3))
librosa.display.waveshow(signal, sr=sr, color="steelblue", alpha=0.8)
plt.title("waveform BASIC5000_0001.wav")
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.show()

This visualised waveform confirms that information about the signal in the previous section is correct. The signal is about 3.2 seconds, and the maximum and minimum values of the signal amplitude are about 0.3 and -0.3, respectively.

The set of BASIC5000 also has a corresponding phone labels in the HTK (Hidden Markov Toolkit) format: https://github.com/sarulab-speech/jsut-label.

lab = []
with open("./data/BASIC5000_0001.lab") as ifile:
    for line in ifile:
        line = line.rstrip("\n")
        lab.append(line)
for l in lab[:5]:
    print(l)

Output 0 3125000 xx^xx-sil+m=i/A:xx+xx+xx/B:xx-xx_xx/C:xx_xx+xx/D:02+xx_xx/E:xx_xx!xx_xx-xx/F:xx_xx#xx_xx@xx_xx|xx_xx/G:3_3%0_xx_xx/H:xx_xx/I:xx-xx@xx+xx&xx-xx|xx+xx/J:5_23/K:1+5-23
3125000 3525000 xx^sil-m+i=z/A:-2+1+3/B:xx-xx_xx/C:02_xx+xx/D:13+xx_xx/E:xx_xx!xx_xx-xx/F:3_3#0_xx@1_5|1_23/G:7_2%0_xx_1/H:xx_xx/I:5-23@1+1&1-5|1+23/J:xx_xx/K:1+5-23
3525000 4325000 sil^m-i+z=u/A:-2+1+3/B:xx-xx_xx/C:02_xx+xx/D:13+xx_xx/E:xx_xx!xx_xx-xx/F:3_3#0_xx@1_5|1_23/G:7_2%0_xx_1/H:xx_xx/I:5-23@1+1&1-5|1+23/J:xx_xx/K:1+5-23
4325000 5225000 m^i-z+u=o/A:-1+2+2/B:xx-xx_xx/C:02_xx+xx/D:13+xx_xx/E:xx_xx!xx_xx-xx/F:3_3#0_xx@1_5|1_23/G:7_2%0_xx_1/H:xx_xx/I:5-23@1+1&1-5|1+23/J:xx_xx/K:1+5-23
5225000 5525000 i^z-u+o=m/A:-1+2+2/B:xx-xx_xx/C:02_xx+xx/D:13+xx_xx/E:xx_xx!xx_xx-xx/F:3_3#0_xx@1_5|1_23/G:7_2%0_xx_1/H:xx_xx/I:5-23@1+1&1-5|1+23/J:xx_xx/K:1+5-23

My interests are two things: start and end timestamps of each phone label and phone labels.

These lines of Python code parse the labels and extract information what I need.

def extract_phone_labels(lab_data):
    extracted = []
    for lab in lab_data:
        stime, etime, ph_label = lab.split()

        ph_ctxt = ph_label.split("/", 1)[0]
        ph = ph_ctxt.split("-")[1].split("+")[0]

        # 100ns -> second
        start = int(stime) * 1e-7
        end = int(etime) * 1e-7
        extracted.append((start, end, ph))
    return extracted

extracted = extract_phone_labels(lab)
for s, e, ph in extracted[:5]:
    print(s, e, ph)

Output 0.0 0.3125 sil
0.3125 0.3525 m
0.3525 0.4325 i
0.4325 0.5225 z
0.5225 0.5525 u

These labels look ready to use for waveform annotation.

The Python code to annotate the waveform is as follows:

plt.figure(figsize=(12, 3))
librosa.display.waveshow(signal, sr=sr, color="steelblue", alpha=0.8)
plt.xlabel("Time (s)")
plt.ylabel("Amplitude")
plt.title("Waveform with All Phoneme Labels")

# Draw faint boundaries instead of full shaded spans
for start, end, label in extracted:
    plt.axvline(start, color="orange", alpha=0.2, linewidth=0.8)
    plt.axvline(end, color="orange", alpha=0.1, linewidth=0.5)
    plt.text(
        (start + end) / 2,
        0.9 * max(signal),
        label,
        ha="center",
        va="bottom",
        fontsize=9,
        color="black",
        rotation=90,  # vertical text saves space
    )
plt.tight_layout()
plt.show()

Does the visualised audio sound like phone labels?

from IPython.display import Audio, display
display(Audio(signal, rate=sr))

Slicing signal and visualise spectrum

This post focuses on visualisation of the waveform. I don't include the details of phonetics and phonology (might create a separate post on it in the future).

Slicing a waveform into short signals enable us to analyse particular phones (e.g., consonants and vowels). This is possible thanks to presence of phone labels and timestamps that locate where particular phones start and end.

Example lines of code to execute slicing are as follows:

# slice the signal into periods of each label
def slice_signal(signal, extracted):
    sliced = []
    for start, end, label in extracted:
        start_idx = int(start * sr)
        end_idx = int(end * sr)
        sliced.append((label, start, end, signal[start_idx:end_idx]))
    return sliced
sliced = slice_signal(signal, extracted)

The variable sliced holds a tuple of four items: label, start time, end time and sliced signal. Then, the following lines of code display a sliced signal of a specific index using the Discrete Fourier Transform.

The first phone label is "sil" (silence) and this is not interesting to analyse.

The second phone label is "m" (nasal bilabial consonant), and code to display a spectrum of a sliced waveform is as follows:

# Compute the spectrum (time -> frequency dimension)
import numpy as np
def wav_to_spectrum(signal, sr):
    spectrum = np.fft.rfft(signal)
    magnitude = np.abs(spectrum)
    freqs = np.fft.rfftfreq(len(signal), 1 / sr)
    spectrum = librosa.amplitude_to_db(magnitude, ref=np.max)
    return freqs, spectrum

def display_spectrum(sliced, idx):
    label, start, end, signal = sliced[idx]
    freqs, spectrum = wav_to_spectrum(signal, sr)
    plt.plot(freqs, spectrum)
    plt.title(f"Frequency Spectrum of Slice {label} {start} - {end}")
    plt.xlabel("Frequency (Hz)")
    plt.ylabel("Decibel")
    plt.grid(True)

    plt.tight_layout()
    plt.show()

display_spectrum(sliced, 1)

The DFT output of the first phone "m" looks like in the figure. The x-axis shows the frequency components of this sliced signal and the y-axis the energy level of each frequency component in decibel.

For interested readers, this post explains how the DFT analyses frequency components contained in a waveform.

Another post explains the logarithms and the decibel as a measurement of sound loudness. \( dB = 20 \times \log_{10}(A_1/A_0) \) is the formula to compute a decibel of \( A_1 \). We take the maximum possible value of a normalised signal for \( A_0 = 1.0 \) and \( A_0 = \log_{10}(1.0) = 0 \). In short, values of the decibel scale are almost always negative and the lower the decibel values of certain frequencies, the quieter the energy.

Back to this example of the phone "m", the highest amplitude peak appears around 250 Hz and the second highest around 500 Hz.

The second phone of this waveform is the close front vowel "i".

This figure shows the first format (large energy peak) around 250 Hz and the second highest around 3,000 Hz. This page shows the figure of Canadian vowels and corresponding first and second (F1 and F2) frequencies: https://home.cc.umanitoba.ca/~krussll/phonetics/acoustic/formants.html. Despite the vowels being Canadian English, the F1 and F2 frequencies of [i] are more or less aligned to this example of Japanese [i].

Moving forward, the 8th and 14th (7th and 13th in Python list indices) vowels are both [a].

The second figure shows the lower energy level between 1,000 and 2,000 Hz than the first figure. Apart from this range of frequencies, the energy level of both [a] seems quite similar.

Summary

This post summarised analysis of a recorded speech waveform of a Japanese female speaker. The lines of code in this post can plot a waveform, annotate it with phone labels (assuming timestamps and labels are available) and analyse shorter segments of a waveform corresponding to phone labels.

Reference

Ryosuke Sonobe, Shinnosuke Takamichi and Hiroshi Saruwatari, "JSUT corpus: free large-scale Japanese speech corpus for end-to-end-speech synthesis," arXiv preprint, 1711.00354, 2017.

The list of Japanese vowels and consonants on Wikipedia:

https://en.wikipedia.org/wiki/Japanese_phonology#Vowels

https://en.wikipedia.org/wiki/Japanese_phonology#Consonants

Speech & Language Blog