A Replication Dataset for Fundamental Frequency Estimation

Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods. © 2020, Bastian Bechtold. All rights reserved. Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech...

Full description

Bibliographic Details
Main Author: Bechtold, Bastian
Format: Dataset
Language:English
Published: Zenodo 2020
Subjects:
Online Access:https://dx.doi.org/10.5281/zenodo.3904389
https://zenodo.org/record/3904389
id ftdatacite:10.5281/zenodo.3904389
record_format openpolar
institution Open Polar
collection DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id ftdatacite
language English
topic signal processing
audio
speech
pitch
fundamental frequency
spellingShingle signal processing
audio
speech
pitch
fundamental frequency
Bechtold, Bastian
A Replication Dataset for Fundamental Frequency Estimation
topic_facet signal processing
audio
speech
pitch
fundamental frequency
description Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods. © 2020, Bastian Bechtold. All rights reserved. Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise. The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time. Included Code and Data ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora: CMU-ARCTIC ( consensus truth ) [1] FDA ( corpus truth and consensus truth ) [2] KEELE ( corpus truth and consensus truth ) [3] MOCHA-TIMIT ( consensus truth ) [4] PTDB-TUG ( corpus truth and consensus truth ) [5] TIMIT ( consensus truth ) [6] noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora: NOISEX [7] QUT-NOISE [8] synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise. noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms: AUTOC [9] AMDF [10] BANA [11] CEP [12] CREPE [13] DIO [14] DNN [15] KALDI [16] MAPS MBSC [17] NLS [18] PEFAC [19] PRAAT [20] RAPT [21] SACC [22] SAFE [23] SHR [24] SIFT [25] SRH [26] STRAIGHT [27] SWIPE [28] YAAPT [29] YIN [30] noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures: Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%. Fine Pitch Error (FPE), the mean error of grossly correct estimates. High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch. Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs. Fine Remaining Bias (FRB), the median error of GREs. True Positive Rate (TPR), the percentage of true positive voicing estimates. False Positive Rate (FPR), the percentage of false positive voicing estimates. False Negative Rate (FNR), the percentage of false negative voicing estimates. F₁, the harmonic mean of precision and recall of the voicing decision. Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs. The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory. References: John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003. Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993. F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995. Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999. Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011. John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993. Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993. David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010. Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968. Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974. Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014. Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967. Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182. Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016. Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014. Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014. Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013. Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically efficient estimator computationally efficient. Signal Processing, 135:188–197, June 2017. Sira Gonzalez and Mike Brookes. PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2):518—530, February 2014. Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences, volume 17, page 97—110. Amsterdam, 1993. David Talkin. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, 495:518, 1995. Byung Suk Lee and Daniel PW Ellis. Noise robust pitch tracking by subband autocorrelation classification. In Interspeech, pages 707–710, 2012. Wei Chu and Abeer Alwan. SAFE: a statistical algorithm for F0 estimation for both clean and noisy speech. In INTERSPEECH, pages 2590–2593, 2010. Xuejing Sun. Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, page I—333. IEEE, 2002. Markel. The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5):367—377, December 1972. Thomas Drugman and Abeer Alwan. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics. In Interspeech, page 1973—1976, 2011. Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In Acous- tics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3933–3936. IEEE, 2008. Arturo Camacho. SWIPE: A sawtooth waveform inspired pitch estimator for speech and music. PhD thesis, University of Florida, 2007. Kavita Kasi and Stephen A. Zahorian. Yet Another Algorithm for Pitch Tracking. In IEEE International Conference on Acoustics Speech and Signal Processing, pages I–361–I–364, Orlando, FL, USA, May 2002. IEEE. Alain de Cheveigné and Hideki Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917, 2002.
format Dataset
author Bechtold, Bastian
author_facet Bechtold, Bastian
author_sort Bechtold, Bastian
title A Replication Dataset for Fundamental Frequency Estimation
title_short A Replication Dataset for Fundamental Frequency Estimation
title_full A Replication Dataset for Fundamental Frequency Estimation
title_fullStr A Replication Dataset for Fundamental Frequency Estimation
title_full_unstemmed A Replication Dataset for Fundamental Frequency Estimation
title_sort replication dataset for fundamental frequency estimation
publisher Zenodo
publishDate 2020
url https://dx.doi.org/10.5281/zenodo.3904389
https://zenodo.org/record/3904389
long_lat ENVELOPE(47.867,47.867,-67.967,-67.967)
ENVELOPE(-58.250,-58.250,-63.917,-63.917)
ENVELOPE(-63.717,-63.717,-64.283,-64.283)
ENVELOPE(8.107,8.107,62.667,62.667)
ENVELOPE(-56.933,-56.933,-64.333,-64.333)
ENVELOPE(65.307,65.307,-70.509,-70.509)
geographic Arctic
Christensen
Gonzalez
Pablo
Sira
Bello
Mervyn
geographic_facet Arctic
Christensen
Gonzalez
Pablo
Sira
Bello
Mervyn
genre Arctic
genre_facet Arctic
op_relation https://dx.doi.org/10.5281/zenodo.3904388
op_rights Open Access
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
CC-BY-4.0
info:eu-repo/semantics/openAccess
op_rightsnorm CC-BY
op_doi https://doi.org/10.5281/zenodo.3904389
https://doi.org/10.5281/zenodo.3904388
_version_ 1766350428699623424
spelling ftdatacite:10.5281/zenodo.3904389 2023-05-15T15:20:12+02:00 A Replication Dataset for Fundamental Frequency Estimation Bechtold, Bastian 2020 https://dx.doi.org/10.5281/zenodo.3904389 https://zenodo.org/record/3904389 en eng Zenodo https://dx.doi.org/10.5281/zenodo.3904388 Open Access Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode CC-BY-4.0 info:eu-repo/semantics/openAccess CC-BY signal processing audio speech pitch fundamental frequency dataset Dataset 2020 ftdatacite https://doi.org/10.5281/zenodo.3904389 https://doi.org/10.5281/zenodo.3904388 2021-11-05T12:55:41Z Part of the dissertation Pitch of Voiced Speech in the Short-Time Fourier Transform: Algorithms, Ground Truths, and Evaluation Methods. © 2020, Bastian Bechtold. All rights reserved. Estimating the fundamental frequency of speech remains an active area of research, with varied applications in speech recognition, speaker identification, and speech compression. A vast number of algorithms for estimatimating this quantity have been proposed over the years, and a number of speech and noise corpora have been developed for evaluating their performance. The present dataset contains estimated fundamental frequency tracks of 25 algorithms, six speech corpora, two noise corpora, at nine signal-to-noise ratios between -20 and 20 dB SNR, as well as an additional evaluation of synthetic harmonic tone complexes in white noise. The dataset also contains pre-calculated performance measures both novel and traditional, in reference to each speech corpus’ ground truth, the algorithms’ own clean-speech estimate, and our own consensus truth. It can thus serve as the basis for a comparison study, or to replicate existing studies from a larger dataset, or as a reference for developing new fundamental frequency estimation algorithms. All source code and data is available to download, and entirely reproducible, albeit requiring about one year of processor-time. Included Code and Data ground truth data.zip is a JBOF dataset of fundamental frequency estimates and ground truths of all speech files in the following corpora: CMU-ARCTIC ( consensus truth ) [1] FDA ( corpus truth and consensus truth ) [2] KEELE ( corpus truth and consensus truth ) [3] MOCHA-TIMIT ( consensus truth ) [4] PTDB-TUG ( corpus truth and consensus truth ) [5] TIMIT ( consensus truth ) [6] noisy speech data.zip is a JBOF datasets of fundamental frequency estimates of speech files mixed with noise from the following corpora: NOISEX [7] QUT-NOISE [8] synthetic speech data.zip is a JBOF dataset of fundamental frequency estimates of synthetic harmonic tone complexes in white noise. noisy_speech.pkl and synthetic_speech.pkl are pickled Pandas dataframes of performance metrics derived from the above data for the following list of fundamental frequency estimation algorithms: AUTOC [9] AMDF [10] BANA [11] CEP [12] CREPE [13] DIO [14] DNN [15] KALDI [16] MAPS MBSC [17] NLS [18] PEFAC [19] PRAAT [20] RAPT [21] SACC [22] SAFE [23] SHR [24] SIFT [25] SRH [26] STRAIGHT [27] SWIPE [28] YAAPT [29] YIN [30] noisy speech evaluation.py and synthetic speech evaluation.py are Python programs to calculate the above Pandas dataframes from the above JBOF datasets. They calculate the following performance measures: Gross Pitch Error (GPE), the percentage of pitches where the estimated pitch deviates from the true pitch by more than 20%. Fine Pitch Error (FPE), the mean error of grossly correct estimates. High/Low Octave Pitch Error (OPE), the percentage pitches that are GPEs and happens to be at an integer multiple of the true pitch. Gross Remaining Error (GRE), the percentage of pitches that are GPEs but not OPEs. Fine Remaining Bias (FRB), the median error of GREs. True Positive Rate (TPR), the percentage of true positive voicing estimates. False Positive Rate (FPR), the percentage of false positive voicing estimates. False Negative Rate (FNR), the percentage of false negative voicing estimates. F₁, the harmonic mean of precision and recall of the voicing decision. Pipfile is a pipenv-compatible pipfile for installing all prerequisites necessary for running the above Python programs. The Python programs take about an hour to compute on a fast 2019 computer, and require at least 32 Gb of memory. References: John Kominek and Alan W Black. CMU ARCTIC database for speech synthesis, 2003. Paul C Bagshaw, Steven Hiller, and Mervyn A Jack. Enhanced Pitch Tracking and the Processing of F0 Contours for Computer Aided Intonation Teaching. In EUROSPEECH, 1993. F Plante, Georg F Meyer, and William A Ainsworth. A Pitch Extraction Reference Database. In Fourth European Conference on Speech Communication and Technology, pages 837–840, Madrid, Spain, 1995. Alan Wrench. MOCHA MultiCHannel Articulatory database: English, November 1999. Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and Franz Pernkopf. A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario. page 4, 2011. John S. Garofolo, Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. TIMIT Acoustic-Phonetic Continuous Speech Corpus, 1993. Andrew Varga and Herman J.M. Steeneken. Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recog- nition systems. Speech Communication, 12(3):247–251, July 1993. David B. Dean, Sridha Sridharan, Robert J. Vogt, and Michael W. Mason. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. Proceedings of Interspeech 2010, 2010. Man Mohan Sondhi. New methods of pitch extraction. Audio and Electroacoustics, IEEE Transactions on, 16(2):262—266, 1968. Myron J. Ross, Harry L. Shaffer, Asaf Cohen, Richard Freudberg, and Harold J. Manley. Average magnitude difference function pitch extractor. Acoustics, Speech and Signal Processing, IEEE Transactions on, 22(5):353—362, 1974. Na Yang, He Ba, Weiyang Cai, Ilker Demirkol, and Wendi Heinzelman. BaNa: A Noise Resilient Fundamental Frequency Detection Algorithm for Speech and Music. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):1833–1848, December 2014. Michael Noll. Cepstrum Pitch Determination. The Journal of the Acoustical Society of America, 41(2):293–309, 1967. Jong Wook Kim, Justin Salamon, Peter Li, and Juan Pablo Bello. CREPE: A Convolutional Representation for Pitch Estimation. arXiv:1802.06182 [cs, eess, stat], February 2018. arXiv: 1802.06182. Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. WORLD: A Vocoder-Based High-Quality Speech Synthesis System for Real-Time Applications. IEICE Transactions on Information and Systems, E99.D(7):1877–1884, 2016. Kun Han and DeLiang Wang. Neural Network Based Pitch Tracking in Very Noisy Speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(12):2158–2168, Decem- ber 2014. Pegah Ghahremani, Bagher BabaAli, Daniel Povey, Korbinian Riedhammer, Jan Trmal, and Sanjeev Khudanpur. A pitch extraction algorithm tuned for automatic speech recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, pages 2494–2498. IEEE, 2014. Lee Ngee Tan and Abeer Alwan. Multi-band summary correlogram-based pitch detection for noisy speech. Speech Communication, 55(7-8):841–856, September 2013. Jesper Kjær Nielsen, Tobias Lindstrøm Jensen, Jesper Rindom Jensen, Mads Græsbøll Christensen, and Søren Holdt Jensen. Fast fundamental frequency estimation: Making a statistically efficient estimator computationally efficient. Signal Processing, 135:188–197, June 2017. Sira Gonzalez and Mike Brookes. PEFAC - A Pitch Estimation Algorithm Robust to High Levels of Noise. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(2):518—530, February 2014. Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the institute of phonetic sciences, volume 17, page 97—110. Amsterdam, 1993. David Talkin. A robust algorithm for pitch tracking (RAPT). Speech coding and synthesis, 495:518, 1995. Byung Suk Lee and Daniel PW Ellis. Noise robust pitch tracking by subband autocorrelation classification. In Interspeech, pages 707–710, 2012. Wei Chu and Abeer Alwan. SAFE: a statistical algorithm for F0 estimation for both clean and noisy speech. In INTERSPEECH, pages 2590–2593, 2010. Xuejing Sun. Pitch determination and voice quality analysis using subharmonic-to-harmonic ratio. In Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, volume 1, page I—333. IEEE, 2002. Markel. The SIFT algorithm for fundamental frequency estimation. IEEE Transactions on Audio and Electroacoustics, 20(5):367—377, December 1972. Thomas Drugman and Abeer Alwan. Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics. In Interspeech, page 1973—1976, 2011. Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In Acous- tics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3933–3936. IEEE, 2008. Arturo Camacho. SWIPE: A sawtooth waveform inspired pitch estimator for speech and music. PhD thesis, University of Florida, 2007. Kavita Kasi and Stephen A. Zahorian. Yet Another Algorithm for Pitch Tracking. In IEEE International Conference on Acoustics Speech and Signal Processing, pages I–361–I–364, Orlando, FL, USA, May 2002. IEEE. Alain de Cheveigné and Hideki Kawahara. YIN, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111(4):1917, 2002. Dataset Arctic DataCite Metadata Store (German National Library of Science and Technology) Arctic Christensen ENVELOPE(47.867,47.867,-67.967,-67.967) Gonzalez ENVELOPE(-58.250,-58.250,-63.917,-63.917) Pablo ENVELOPE(-63.717,-63.717,-64.283,-64.283) Sira ENVELOPE(8.107,8.107,62.667,62.667) Bello ENVELOPE(-56.933,-56.933,-64.333,-64.333) Mervyn ENVELOPE(65.307,65.307,-70.509,-70.509)