Non-intrusive method for audio quality assessment of lossy-compressed music recordings using convolutional neural networks

Authors

  • Aleksandra Kasperuk Faculty of Computer Science, Białystok University of Technology
  • Sławomir Krzysztof Zieliński Faculty of Computer Science, Białystok University of Technology

Abstract

Most of the existing algorithms for the objective audio quality assessment are intrusive, as they require access both to an unimpaired reference recording and an evaluated signal. This feature excludes them from many practical applications. In this paper, we introduce a non-intrusive audio quality assessment method. The proposed method is intended to account for audio artefacts arising from the lossy compression of music signals. During its development, 250 high-quality uncompressed music recordings were collated. They were subsequently processed using the selection of five popular audio codecs, resulting in the repository of 13,000 audio excerpts representing various levels of audio quality. The proposed non-intrusive method was trained with the data obtained employing a well-established intrusive model (ViSQOL v3). Next, the performance of the trained model was evaluated utilizing the quality scores obtained in the subjective listening tests undertaken remotely over the Internet. The listening tests were carried out in compliance with the MUSHRA recommendation (ITU-R BS.1534-3). In this study, the following three convolutional neural networks were compared: (1) a model employing 1D convolutional filters, (2) an Inception-based model, and (3) a VGG-based model. The last-mentioned model outperformed the model employing 1D convolutional filters in terms of predicting the scores from the listening tests, reaching a correlation value of 0.893. The performance of the Inception-based model was similar to that of the VGG-based model. Moreover, the VGG-based model outperformed the method employing a stacked gated-recurrent-unit-based deep learning framework, recently introduced by Mumtaz et al. (2022).

References

ITU-R BS. 1116-3 Recommendation. “Methods for the subjective assessment of small impairments in audio systems,” International Telecommunication Union, Geneva, 2015.

ITU-R BS. 1534-3 Recommendation. “Method for the subjective assessment of intermediate quality level of audio systems,” International Telecommunication Union, Geneva, 2015.

C. Sloan, N. Harte, D. Kelly, A.C. Kokaram, and A. Hines, “Objective Assessment of Perceptual Audio Quality Using ViSQOLAudio,” IEEE Transactions on Broadcasting, vol. 63, pp. 693–705, Dec. 2017. https://doi.org/10.1109/TBC.2017.2704421

S. Kiranyaz, O. Avci, O. Abdeljaber, T. Ince, M. Gabbouj, and D.J. Inman, “1D convolutional neural networks and applications: A survey,” Mechanical Systems and Signal Processing, vol. 151, 107398, 2021. https://doi.org/10.1016/j.ymssp.2020.107398

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed et al., “Going deeper with convolutions,” in Proc. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, pp. 1−9, 2015. https://doi.org/10.1109/CVPR.2015.7298594

K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” in Proc. International Conference on Learning Representations (ICLR), arXiv:1409.1556, 2015. https://doi.org/10.48550/arXiv.1409.1556

M. Chinen, F. S. C. Lim, J. Skoglund, N. Gureev, F. O'Gorman, and A. Hines, “ViSQOL v3: An Open Source Production Ready Objective Speech and Audio Metric,” in Proc. 2020 Twelfth International Conference on Quality of Multimedia Experience (QoMEX), Athlone, Ireland, 2020. https://doi.org/10.1109/QoMEX48832.2020.9123150

M. Karjalainen, “A new auditory model for the evaluation of sound quality of audio systems,” in Proc. ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing, Tampa, FL, USA, 1985. https://doi.org/10.1109/ICASSP.1985.1168376

T. Thiede, W. Treurniet, R. Bitto, C. Schmidmer, T. Sporer, J. Beerends, C. Colomes, M. Keyhl, G. Stoll, K. Brandenburg, and B. Feiten, “PEAQ—the ITU standard for objective measurement of perceived audio quality,” J. Audio Eng. Soc., vol. 48, pp. 3−29, 2000. http://www.aes.org/e-lib/browse.cfm?elib=12078

ITU-R BS. 1387-2 Recommendation. “Method for objective measurements of perceived audio quality,” International Telecommunication Union, Geneva, 2023.

R. Huber and B. Kollmeier, “PEMO-Q—A New Method for Objective Audio Quality Assessment Using a Model of Auditory Perception,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, pp. 1902−1911, 2006. https://doi.org/10.1109/TASL.2006.883259

J. M. Kates and K. H. Arehart, “The Hearing-Aid Audio Quality Index (HAAQI),” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, pp. 354−365, 2016. https://doi.org/10.1109/TASLP.2015.2507858

G. Jiang, A. Biswas, C. Bergler, and A. Maier, “InSE-NET: A Perceptually Coded Audio Quality Model based on CNN,” in Proc. 151st Audio Engineering Society Convention, Online, 2021. http://www.aes.org/e-lib/browse.cfm?elib=21478

P. M. Delgado and J. Herre, “Can We Still Use PEAQ? A Performance Analysis of the ITU Standard for the Objective Assessment of Perceived Audio Quality,” in Proc. Twelfth International Conference on Quality of Multimedia Experience (QoMEX), Athlone, Ireland, 2020. https://doi.org/10.1109/QoMEX48832.2020.9123105

R. E. Zezario, S.-W. Fu, F. Chen, C.-S. Fuh, H.-M. Wang, and Y. Tsao, “Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 54−70, 2023. https://doi.org/10.1109/TASLP.2022.3205757

C. K. A. Reddy, V. Gopal, and R. Cutler, “Dnsmos P.835: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors,” in Proc. 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 2022. https://doi.org/10.1109/ICASSP43922.2022.9746108

A. A. Catellier and S. D. Voran, “Wawenets: A No-Reference Convolutional Waveform-Based Approach to Estimating Narrowband and Wideband Speech Quality,” in Proc. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 2020. https://doi.org/10.1109/ICASSP40776.2020.9054204

S.-W. Fu, Y. Tsao, H.-T. Hwang, and H.-M. Wang, “Quality-Net: An end-to-end non-intrusive speech quality assessment model based on BLSTM,” in Proc. Interspeech, Hyderabad. India, pp. 1873−1877, 2018. https://doi.org/10.48550/arXiv.1808.05344

G. Mittag, B. Naderi, A. Chehadi, and Sebastian Möller, “NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets,” in Proc. Interspeech, Brno, Czechia, pp. 2127−2131, 2021. https://doi.org/10.21437/Interspeech.2021-299

C. Sørensen, J. B. Boldt, and M. G. Christensen, “Validation of the Non-Intrusive Codebook-based Short Time Objective Intelligibility Metric for Processed Speech,” in Proc. Interspeech, Graz, Austria, pp. 4270−4274, 2019. http://dx.doi.org/10.21437/Interspeech.2019-1625

D. Mumtaz, V. Jakhetiya, K. Nathwani, B. N. Subudhi, and S. C. Guntuku, “Nonintrusive Perceptual Audio Quality Assessment for User-Generated Content Using Deep Learning,” IEEE Transactions on Industrial Informatics, vol. 18, pp. 7780−7789, 2022. https://doi.org/10.1109/TII.2021.3139010

K Organiściak and J. Borkowski, “Single-ended quality measurement of a music content via convolutional recurrent neural networks,” Metrology and Measurement Systems, vol. 27, pp. 721−733, 2020. https://doi.org/10.24425/mms.2020.134849

EBU R.128 Recommendation, “Loudness normalization and permitted maximum level of audio signals,” European Broadcasting Union, Geneva, 2020.

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge.” International Journal of Computer Vision, vol. 115, pp. 211–252, 2015. https://doi.org/10.1007/s11263-015-0816-y

D. P. Kingma and J. L. Ba, “ADAM: a method for stochastic optimization,” in Proc. 3rd International Conference on Learning Representations (ICLR 2015), San Diego, pp. 1–15, 2015. https://doi.org/10.48550/arXiv.1412.6980

A. Kasperuk, “Software repository. Nonintrusive audio quality assessment ISSET2023,” GitHub, https://github.com/WaitWhatSon/nonintrusive_audio_quality_assessment_isset2023 (accessed on August 18, 2023).

M. Schoeffler, F. Stöter, B. Edler, and J. Herre, “Towards the Next Generation of Web-based Experiments: A Case Study Assessing Basic Audio Quality Following the ITU-R Recommendation BS.1534 (MUSHRA),” in Proc. 1st Web Audio Conference, Paris, France, 2015.

“The 'Mixing Secrets' Free Multitrack Download Library,” Cambridge Music Technology, https://cambridge-mt.com/ms/mtk/ (accessed on June 10, 2023).

R. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam, and J. P. Bello, “MedleyDB: A Multitrack Dataset for Annotation-Intensive MIR Research,” in Proc. 15th International Society for Music Information Retrieval Conference, Taipei, Taiwan, 2014.

S. K. Zieliński, “On Some Biases Encountered in Modern Audio Quality Listening Tests (Part 2): Selected Graphical Examples and Discussion,” J. Audio Eng. Soc., vol. 64, pp. 55−74, 2016. http://www.aes.org/e-lib/browse.cfm?elib=18105

Additional Files

Published

2024-06-20

Issue

Section

ARTICLES / PAPERS / General