Robust speech parametrization based on pitch synchronized cepstral solutions

Authors

  • Stanisław Gmyrek Wroclaw University of Science and Technology, Department of Acoustics, Multimedia and Signal Processing
  • Robert Hossa Wroclaw University of Science and Technology, Department of Acoustics, Multimedia and Signal Processing

Abstract

In general, the speech signal can be described by the excitation signal, the impulse response of the vocal tract, and a system that describes the impact of speech emission through human lips. The characteristics of the vocal tract primarily shape the semantic content of speech. Regrettably, the irregular periodicity of glottal excitation represents a significant factor in generating substantial distortions (ripples) in the amplitude spectrum of voiced speech. In this study, a PS-STFT (Pitch-Synchronized Short-Time Fourier Transform) method was proposed to achieve a reliable amplitude spectrum of the vocal tract. Subsequently, a set of cepstral coefficient vectors, namely PS-HFCC (Pitch Synchronized Human Factor Cepstral Coefficients), as a chosen representative of the commonly used classical cepstral parameterization methods was analyzed to investigate the statistical properties after correction. Additionally, the widely accepted in speech recognition applications, the GMM (Gaussian Mixture Model) was chosen as the statistical acoustic model of individual Polish speech phonemes. To evaluate the quality of the proposed method, the distances between the multivariate probability distributions of the GMM form were calculated. Modifying classical cepstral methods through the analysis of variable-length signal frames synchronized to the fundamental period resulted in a reduction in the variance of the estimators of the cepstral coefficients, leading to an increase in the distances between the probability distributions and, consequently, improved classification results.

Additional Files

Published

2025-07-09

Issue

Section

Acoustics