Acoustic features for speech processing
Introduction This post summarises acoustic features for various tasks of speech processing. Automatic speech recognition (ASR) is one of the most studied speech processing tasks. Acoustic features for ASR include Mel-frequency cepstrum coefficients (MFCCs) and spectogram-based features including Mel-spectrograms and Mel-filter banks. The choice of acoustic features depends on a choice of ASR model: Traditional machine learning (ML) models such as Gaussian mixture models (GMMs) have difficulties of handling correlated features and MFCCs are favourite for de-correlated coeffcients. More recent deep learning based models e.g., Conformer use acoustic feature vectors with correlation between neighbour dimensions: filterbanks. A popular model from OpenAI, Whisper , directly takes as input a log-Mel spectrogram which is technically the same representation as filterbanks (will be explained later). ...