SUPERSEDED - The Edinburgh International Accents of English Corpus
## This item has been replaced by the one which can be found at https://datashare.ed.ac.uk/handle/10283/4836 - https://doi.org/10.7488/ds/3832 ##. English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a...
Main Authors: | , , , , , |
---|---|
Other Authors: | |
Format: | Dataset |
Language: | English |
Published: |
University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research
2022
|
Subjects: | |
Online Access: | https://hdl.handle.net/10283/4766 https://doi.org/10.7488/ds/3780 |
id |
ftuedinburgheds:oai:datashare.ed.ac.uk:10283/4766 |
---|---|
record_format |
openpolar |
institution |
Open Polar |
collection |
Edinburgh DataShare (University of Edinburgh) |
op_collection_id |
ftuedinburgheds |
language |
English |
topic |
conversational speech bias in speech recognition English accents Linguistics Linguistics Classics and related subjects::English as a second language |
spellingShingle |
conversational speech bias in speech recognition English accents Linguistics Linguistics Classics and related subjects::English as a second language Sanabria, Ramon Nikolay, Bogoychev Nina, Markl Carmantini, Andrea Klejch, Ondrej Bell, Peter SUPERSEDED - The Edinburgh International Accents of English Corpus |
topic_facet |
conversational speech bias in speech recognition English accents Linguistics Linguistics Classics and related subjects::English as a second language |
description |
## This item has been replaced by the one which can be found at https://datashare.ed.ac.uk/handle/10283/4836 - https://doi.org/10.7488/ds/3832 ##. English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a result, there are many varieties of English. Although the great many advances in English automatic speech recognition (ASR) over the past decades, results are usually reported based on test datasets which fail to represent the diversity of English as spoken today around the globe. We present the first release of The Edinburgh International Accents of English Corpus (EdAcc). This dataset attempts to better represent the wide diversity of English, encompassing almost 40 hours of dyadic video call conversations between friends. Unlike other datasets, EdAcc includes a wide range of first and second-language varieties of English and a linguistic background profile of each speaker. Results on latest public, and commercial models show that EdAcc highlights shortcomings of current English ASR models. The best performing model, trained on 680 thousand hours of transcribed data, obtains an average of 19.7% WER -- in contrast to the the 2.7% WER obtained when evaluated on US English clean read speech. Across all models, we observe a drop in performance on Jamaican, Indonesian, Nigerian, and Kenyan English speakers. Recordings, linguistic backgrounds, data statement, and evaluation scripts are released on our website under CC-BY-SA. README.txt |
author2 |
University of Edinburgh Sanabria, Ramon |
format |
Dataset |
author |
Sanabria, Ramon Nikolay, Bogoychev Nina, Markl Carmantini, Andrea Klejch, Ondrej Bell, Peter |
author_facet |
Sanabria, Ramon Nikolay, Bogoychev Nina, Markl Carmantini, Andrea Klejch, Ondrej Bell, Peter |
author_sort |
Sanabria, Ramon |
title |
SUPERSEDED - The Edinburgh International Accents of English Corpus |
title_short |
SUPERSEDED - The Edinburgh International Accents of English Corpus |
title_full |
SUPERSEDED - The Edinburgh International Accents of English Corpus |
title_fullStr |
SUPERSEDED - The Edinburgh International Accents of English Corpus |
title_full_unstemmed |
SUPERSEDED - The Edinburgh International Accents of English Corpus |
title_sort |
superseded - the edinburgh international accents of english corpus |
publisher |
University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research |
publishDate |
2022 |
url |
https://hdl.handle.net/10283/4766 https://doi.org/10.7488/ds/3780 |
op_coverage |
Global UK UNITED KINGDOM AF AFGHANISTAN AX ÅLAND ISLANDS AL ALBANIA DZ ALGERIA AS AMERICAN SAMOA AD ANDORRA AO ANGOLA AI ANGUILLA AQ ANTARCTICA AG ANTIGUA AND BARBUDA AR ARGENTINA AM ARMENIA AW ARUBA AU AUSTRALIA AT AUSTRIA AZ AZERBAIJAN BS BAHAMAS BH BAHRAIN BD BANGLADESH BB BARBADOS BY BELARUS BE BELGIUM BZ BELIZE BJ BENIN BM BERMUDA BT BHUTAN BO BOLIVIA, PLURINATIONAL STATE OF BQ BONAIRE, SINT EUSTATIUS AND SABA BA BOSNIA AND HERZEGOVINA BW BOTSWANA BV BOUVET ISLAND BR BRAZIL IO BRITISH INDIAN OCEAN TERRITORY BN BRUNEI DARUSSALAM BG BULGARIA BF BURKINA FASO BI BURUNDI KH CAMBODIA CM CAMEROON CA CANADA CV CAPE VERDE KY CAYMAN ISLANDS CF CENTRAL AFRICAN REPUBLIC TD CHAD CL CHILE CN CHINA CX CHRISTMAS ISLAND CC COCOS (KEELING) ISLANDS CO COLOMBIA KM COMOROS CG CONGO CD CONGO, THE DEMOCRATIC REPUBLIC OF THE CK COOK ISLANDS CR COSTA RICA CI CÔTE D'IVOIRE HR CROATIA CU CUBA CW CURAÇAO CY CYPRUS CZ CZECHIA DK DENMARK DJ DJIBOUTI DM DOMINICA DO DOMINICAN REPUBLIC EC ECUADOR EG EGYPT SV EL SALVADOR GQ EQUATORIAL GUINEA ER ERITREA EE ESTONIA SZ ESWATINI ET ETHIOPIA FK FALKLAND ISLANDS (MALVINAS) FO FAROE ISLANDS FJ FIJI FI FINLAND FR FRANCE GF FRENCH GUIANA PF FRENCH POLYNESIA TF FRENCH SOUTHERN TERRITORIES GA GABON GM GAMBIA GE GEORGIA DE GERMANY GH GHANA GI GIBRALTAR GR GREECE GL GREENLAND GD GRENADA GP GUADELOUPE GU GUAM GT GUATEMALA GG GUERNSEY GN GUINEA GW GUINEA-BISSAU GY GUYANA HT HAITI HM HEARD ISLAND AND MCDONALD ISLANDS VA HOLY SEE HN HONDURAS HK HONG KONG HU HUNGARY IS ICELAND IN INDIA ID INDONESIA IR IRAN, ISLAMIC REPUBLIC OF IQ IRAQ IE IRELAND IM ISLE OF MAN IL ISRAEL IT ITALY JM JAMAICA JP JAPAN JE JERSEY JO JORDAN KZ KAZAKHSTAN KE KENYA KI KIRIBATI KP KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF KR KOREA, REPUBLIC OF KW KUWAIT KG KYRGYZSTAN LA LAO PEOPLE'S DEMOCRATIC REPUBLIC LV LATVIA LB LEBANON LS LESOTHO LR LIBERIA LY LIBYA LI LIECHTENSTEIN LT LITHUANIA LU LUXEMBOURG MO MACAO MG MADAGASCAR MW MALAWI MY MALAYSIA MV MALDIVES ML MALI MT MALTA MH MARSHALL ISLANDS MQ MARTINIQUE MR MAURITANIA MU MAURITIUS YT MAYOTTE MX MEXICO FM MICRONESIA, FEDERATED STATES OF MD MOLDOVA, REPUBLIC OF MC MONACO MN MONGOLIA ME MONTENEGRO MS MONTSERRAT MA MOROCCO MZ MOZAMBIQUE MM MYANMAR NA NAMIBIA NR NAURU NP NEPAL NL NETHERLANDS NC NEW CALEDONIA NZ NEW ZEALAND NI NICARAGUA NE NIGER NG NIGERIA NU NIUE NF NORFOLK ISLAND MK NORTH MACEDONIA MP NORTHERN MARIANA ISLANDS NO NORWAY OM OMAN PK PAKISTAN PW PALAU PS PALESTINE, STATE OF PA PANAMA PG PAPUA NEW GUINEA PY PARAGUAY PE PERU PH PHILIPPINES PN PITCAIRN PL POLAND PT PORTUGAL PR PUERTO RICO QA QATAR RE RÉUNION RO ROMANIA RU RUSSIAN FEDERATION RW RWANDA BL SAINT BARTHÉLEMY SH SAINT HELENA, ASCENSION AND TRISTAN DA CUNHA KN SAINT KITTS AND NEVIS LC SAINT LUCIA MF SAINT MARTIN (FRENCH PART) PM SAINT PIERRE AND MIQUELON VC SAINT VINCENT AND THE GRENADINES WS SAMOA SM SAN MARINO ST SAO TOME AND PRINCIPE SA SAUDI ARABIA SN SENEGAL RS SERBIA SC SEYCHELLES SL SIERRA LEONE SG SINGAPORE SX SINT MAARTEN (DUTCH PART) SK SLOVAKIA SI SLOVENIA SB SOLOMON ISLANDS SO SOMALIA ZA SOUTH AFRICA GS SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS SS SOUTH SUDAN ES SPAIN LK SRI LANKA SD SUDAN SR SURINAME SJ SVALBARD AND JAN MAYEN SE SWEDEN CH SWITZERLAND SY SYRIAN ARAB REPUBLIC TW TAIWAN, PROVINCE OF CHINA TJ TAJIKISTAN TZ TANZANIA, UNITED REPUBLIC OF TH THAILAND TL TIMOR-LESTE TG TOGO TK TOKELAU TO TONGA TT TRINIDAD AND TOBAGO TN TUNISIA TR TURKEY TM TURKMENISTAN TC TURKS AND CAICOS ISLANDS TV TUVALU UG UGANDA UA UKRAINE AE UNITED ARAB EMIRATES US UNITED STATES UM UNITED STATES MINOR OUTLYING ISLANDS UY URUGUAY UZ UZBEKISTAN VU VANUATU VE VENEZUELA, BOLIVARIAN REPUBLIC OF VN VIET NAM VG VIRGIN ISLANDS, BRITISH VI VIRGIN ISLANDS, U.S. WF WALLIS AND FUTUNA EH WESTERN SAHARA YE YEMEN ZM ZAMBIA ZW ZIMBABWE |
long_lat |
ENVELOPE(3.358,3.358,-54.422,-54.422) ENVELOPE(3.358,3.358,-54.422,-54.422) ENVELOPE(-68.267,-68.267,-69.317,-69.317) ENVELOPE(73.510,73.510,-53.117,-53.117) ENVELOPE(73.510,73.510,-53.117,-53.117) ENVELOPE(72.600,72.600,-53.033,-53.033) ENVELOPE(149.417,149.417,66.617,66.617) ENVELOPE(-59.515,-59.515,50.600,50.600) ENVELOPE(-33.000,-33.000,-56.000,-56.000) ENVELOPE(20.000,20.000,78.000,78.000) ENVELOPE(7.990,7.990,63.065,63.065) ENVELOPE(-60.734,-60.734,-63.816,-63.816) ENVELOPE(140.900,140.900,-66.735,-66.735) |
geographic |
Argentina Bouvet Bouvet Island Canada Faroe Islands Greenland Guernsey Heard Heard Island Heard Island Indian Jan Mayen McDonald Islands New Zealand Norway Saba Saint-Vincent Sandwich Islands South Georgia South Sandwich Islands Svalbard Svalbard Tonga Trinidad Tristan Uruguay |
geographic_facet |
Argentina Bouvet Bouvet Island Canada Faroe Islands Greenland Guernsey Heard Heard Island Heard Island Indian Jan Mayen McDonald Islands New Zealand Norway Saba Saint-Vincent Sandwich Islands South Georgia South Sandwich Islands Svalbard Svalbard Tonga Trinidad Tristan Uruguay |
genre |
Antarc* Antarctica Bouvet Island Faroe Islands Greenland Heard Island Iceland Jan Mayen McDonald Islands South Sandwich Islands Svalbard |
genre_facet |
Antarc* Antarctica Bouvet Island Faroe Islands Greenland Heard Island Iceland Jan Mayen McDonald Islands South Sandwich Islands Svalbard |
op_relation |
Sanabria, Ramon; Nikolay, Bogoychev; Nina, Markl; Carmantini, Andrea; Klejch, Ondrej; Bell, Peter. (2022). The Edinburgh International Accents of English Corpus, [dataset]. University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research. https://doi.org/10.7488/ds/3780. https://hdl.handle.net/10283/4766 https://doi.org/10.7488/ds/3780 |
op_rights |
CC-BY-SA |
op_doi |
https://doi.org/10.7488/ds/3780 |
_version_ |
1772810197847244800 |
spelling |
ftuedinburgheds:oai:datashare.ed.ac.uk:10283/4766 2023-07-30T03:59:24+02:00 SUPERSEDED - The Edinburgh International Accents of English Corpus Sanabria, Ramon Nikolay, Bogoychev Nina, Markl Carmantini, Andrea Klejch, Ondrej Bell, Peter University of Edinburgh Sanabria, Ramon Global UK UNITED KINGDOM AF AFGHANISTAN AX ÅLAND ISLANDS AL ALBANIA DZ ALGERIA AS AMERICAN SAMOA AD ANDORRA AO ANGOLA AI ANGUILLA AQ ANTARCTICA AG ANTIGUA AND BARBUDA AR ARGENTINA AM ARMENIA AW ARUBA AU AUSTRALIA AT AUSTRIA AZ AZERBAIJAN BS BAHAMAS BH BAHRAIN BD BANGLADESH BB BARBADOS BY BELARUS BE BELGIUM BZ BELIZE BJ BENIN BM BERMUDA BT BHUTAN BO BOLIVIA, PLURINATIONAL STATE OF BQ BONAIRE, SINT EUSTATIUS AND SABA BA BOSNIA AND HERZEGOVINA BW BOTSWANA BV BOUVET ISLAND BR BRAZIL IO BRITISH INDIAN OCEAN TERRITORY BN BRUNEI DARUSSALAM BG BULGARIA BF BURKINA FASO BI BURUNDI KH CAMBODIA CM CAMEROON CA CANADA CV CAPE VERDE KY CAYMAN ISLANDS CF CENTRAL AFRICAN REPUBLIC TD CHAD CL CHILE CN CHINA CX CHRISTMAS ISLAND CC COCOS (KEELING) ISLANDS CO COLOMBIA KM COMOROS CG CONGO CD CONGO, THE DEMOCRATIC REPUBLIC OF THE CK COOK ISLANDS CR COSTA RICA CI CÔTE D'IVOIRE HR CROATIA CU CUBA CW CURAÇAO CY CYPRUS CZ CZECHIA DK DENMARK DJ DJIBOUTI DM DOMINICA DO DOMINICAN REPUBLIC EC ECUADOR EG EGYPT SV EL SALVADOR GQ EQUATORIAL GUINEA ER ERITREA EE ESTONIA SZ ESWATINI ET ETHIOPIA FK FALKLAND ISLANDS (MALVINAS) FO FAROE ISLANDS FJ FIJI FI FINLAND FR FRANCE GF FRENCH GUIANA PF FRENCH POLYNESIA TF FRENCH SOUTHERN TERRITORIES GA GABON GM GAMBIA GE GEORGIA DE GERMANY GH GHANA GI GIBRALTAR GR GREECE GL GREENLAND GD GRENADA GP GUADELOUPE GU GUAM GT GUATEMALA GG GUERNSEY GN GUINEA GW GUINEA-BISSAU GY GUYANA HT HAITI HM HEARD ISLAND AND MCDONALD ISLANDS VA HOLY SEE HN HONDURAS HK HONG KONG HU HUNGARY IS ICELAND IN INDIA ID INDONESIA IR IRAN, ISLAMIC REPUBLIC OF IQ IRAQ IE IRELAND IM ISLE OF MAN IL ISRAEL IT ITALY JM JAMAICA JP JAPAN JE JERSEY JO JORDAN KZ KAZAKHSTAN KE KENYA KI KIRIBATI KP KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF KR KOREA, REPUBLIC OF KW KUWAIT KG KYRGYZSTAN LA LAO PEOPLE'S DEMOCRATIC REPUBLIC LV LATVIA LB LEBANON LS LESOTHO LR LIBERIA LY LIBYA LI LIECHTENSTEIN LT LITHUANIA LU LUXEMBOURG MO MACAO MG MADAGASCAR MW MALAWI MY MALAYSIA MV MALDIVES ML MALI MT MALTA MH MARSHALL ISLANDS MQ MARTINIQUE MR MAURITANIA MU MAURITIUS YT MAYOTTE MX MEXICO FM MICRONESIA, FEDERATED STATES OF MD MOLDOVA, REPUBLIC OF MC MONACO MN MONGOLIA ME MONTENEGRO MS MONTSERRAT MA MOROCCO MZ MOZAMBIQUE MM MYANMAR NA NAMIBIA NR NAURU NP NEPAL NL NETHERLANDS NC NEW CALEDONIA NZ NEW ZEALAND NI NICARAGUA NE NIGER NG NIGERIA NU NIUE NF NORFOLK ISLAND MK NORTH MACEDONIA MP NORTHERN MARIANA ISLANDS NO NORWAY OM OMAN PK PAKISTAN PW PALAU PS PALESTINE, STATE OF PA PANAMA PG PAPUA NEW GUINEA PY PARAGUAY PE PERU PH PHILIPPINES PN PITCAIRN PL POLAND PT PORTUGAL PR PUERTO RICO QA QATAR RE RÉUNION RO ROMANIA RU RUSSIAN FEDERATION RW RWANDA BL SAINT BARTHÉLEMY SH SAINT HELENA, ASCENSION AND TRISTAN DA CUNHA KN SAINT KITTS AND NEVIS LC SAINT LUCIA MF SAINT MARTIN (FRENCH PART) PM SAINT PIERRE AND MIQUELON VC SAINT VINCENT AND THE GRENADINES WS SAMOA SM SAN MARINO ST SAO TOME AND PRINCIPE SA SAUDI ARABIA SN SENEGAL RS SERBIA SC SEYCHELLES SL SIERRA LEONE SG SINGAPORE SX SINT MAARTEN (DUTCH PART) SK SLOVAKIA SI SLOVENIA SB SOLOMON ISLANDS SO SOMALIA ZA SOUTH AFRICA GS SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS SS SOUTH SUDAN ES SPAIN LK SRI LANKA SD SUDAN SR SURINAME SJ SVALBARD AND JAN MAYEN SE SWEDEN CH SWITZERLAND SY SYRIAN ARAB REPUBLIC TW TAIWAN, PROVINCE OF CHINA TJ TAJIKISTAN TZ TANZANIA, UNITED REPUBLIC OF TH THAILAND TL TIMOR-LESTE TG TOGO TK TOKELAU TO TONGA TT TRINIDAD AND TOBAGO TN TUNISIA TR TURKEY TM TURKMENISTAN TC TURKS AND CAICOS ISLANDS TV TUVALU UG UGANDA UA UKRAINE AE UNITED ARAB EMIRATES US UNITED STATES UM UNITED STATES MINOR OUTLYING ISLANDS UY URUGUAY UZ UZBEKISTAN VU VANUATU VE VENEZUELA, BOLIVARIAN REPUBLIC OF VN VIET NAM VG VIRGIN ISLANDS, BRITISH VI VIRGIN ISLANDS, U.S. WF WALLIS AND FUTUNA EH WESTERN SAHARA YE YEMEN ZM ZAMBIA ZW ZIMBABWE 2022-11-07T14:43:56Z application/zip text/csv https://hdl.handle.net/10283/4766 https://doi.org/10.7488/ds/3780 eng eng University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research Sanabria, Ramon; Nikolay, Bogoychev; Nina, Markl; Carmantini, Andrea; Klejch, Ondrej; Bell, Peter. (2022). The Edinburgh International Accents of English Corpus, [dataset]. University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research. https://doi.org/10.7488/ds/3780. https://hdl.handle.net/10283/4766 https://doi.org/10.7488/ds/3780 CC-BY-SA conversational speech bias in speech recognition English accents Linguistics Linguistics Classics and related subjects::English as a second language dataset 2022 ftuedinburgheds https://doi.org/10.7488/ds/3780 2023-07-09T20:29:26Z ## This item has been replaced by the one which can be found at https://datashare.ed.ac.uk/handle/10283/4836 - https://doi.org/10.7488/ds/3832 ##. English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a result, there are many varieties of English. Although the great many advances in English automatic speech recognition (ASR) over the past decades, results are usually reported based on test datasets which fail to represent the diversity of English as spoken today around the globe. We present the first release of The Edinburgh International Accents of English Corpus (EdAcc). This dataset attempts to better represent the wide diversity of English, encompassing almost 40 hours of dyadic video call conversations between friends. Unlike other datasets, EdAcc includes a wide range of first and second-language varieties of English and a linguistic background profile of each speaker. Results on latest public, and commercial models show that EdAcc highlights shortcomings of current English ASR models. The best performing model, trained on 680 thousand hours of transcribed data, obtains an average of 19.7% WER -- in contrast to the the 2.7% WER obtained when evaluated on US English clean read speech. Across all models, we observe a drop in performance on Jamaican, Indonesian, Nigerian, and Kenyan English speakers. Recordings, linguistic backgrounds, data statement, and evaluation scripts are released on our website under CC-BY-SA. README.txt Dataset Antarc* Antarctica Bouvet Island Faroe Islands Greenland Heard Island Iceland Jan Mayen McDonald Islands South Sandwich Islands Svalbard Edinburgh DataShare (University of Edinburgh) Argentina Bouvet ENVELOPE(3.358,3.358,-54.422,-54.422) Bouvet Island ENVELOPE(3.358,3.358,-54.422,-54.422) Canada Faroe Islands Greenland Guernsey ENVELOPE(-68.267,-68.267,-69.317,-69.317) Heard ENVELOPE(73.510,73.510,-53.117,-53.117) Heard Island Heard Island ENVELOPE(73.510,73.510,-53.117,-53.117) Indian Jan Mayen McDonald Islands ENVELOPE(72.600,72.600,-53.033,-53.033) New Zealand Norway Saba ENVELOPE(149.417,149.417,66.617,66.617) Saint-Vincent ENVELOPE(-59.515,-59.515,50.600,50.600) Sandwich Islands South Georgia ENVELOPE(-33.000,-33.000,-56.000,-56.000) South Sandwich Islands Svalbard Svalbard ENVELOPE(20.000,20.000,78.000,78.000) Tonga ENVELOPE(7.990,7.990,63.065,63.065) Trinidad ENVELOPE(-60.734,-60.734,-63.816,-63.816) Tristan ENVELOPE(140.900,140.900,-66.735,-66.735) Uruguay |