SUPERSEDED - The Edinburgh International Accents of English Corpus

## This item has been replaced by the one which can be found at https://datashare.ed.ac.uk/handle/10283/4836 - https://doi.org/10.7488/ds/3832 ##. English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a...

Full description

Bibliographic Details
Main Authors:	Sanabria, Ramon, Nikolay, Bogoychev, Nina, Markl, Carmantini, Andrea, Klejch, Ondrej, Bell, Peter
Other Authors:	University of Edinburgh
Format:	Dataset
Language:	English
Published:	University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research 2022
Subjects:	conversational speech bias in speech recognition English accents Linguistics Linguistics Classics and related subjects::English as a second language Argentina Bouvet Bouvet Island Canada Faroe Islands Greenland Guernsey Heard Heard Island Indian Jan Mayen McDonald Islands New Zealand Norway Saba Saint-Vincent Sandwich Islands South Georgia South Sandwich Islands Svalbard Tonga Trinidad Tristan Uruguay Antarc* Antarctica Iceland
Online Access:	https://hdl.handle.net/10283/4766 https://doi.org/10.7488/ds/3780

id	ftuedinburgheds:oai:datashare.ed.ac.uk:10283/4766
record_format	openpolar
institution	Open Polar
collection	Edinburgh DataShare (University of Edinburgh)
op_collection_id	ftuedinburgheds
language	English
topic	conversational speech bias in speech recognition English accents Linguistics Linguistics Classics and related subjects::English as a second language
spellingShingle	conversational speech bias in speech recognition English accents Linguistics Linguistics Classics and related subjects::English as a second language Sanabria, Ramon Nikolay, Bogoychev Nina, Markl Carmantini, Andrea Klejch, Ondrej Bell, Peter SUPERSEDED - The Edinburgh International Accents of English Corpus
topic_facet	conversational speech bias in speech recognition English accents Linguistics Linguistics Classics and related subjects::English as a second language
description	## This item has been replaced by the one which can be found at https://datashare.ed.ac.uk/handle/10283/4836 - https://doi.org/10.7488/ds/3832 ##. English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a result, there are many varieties of English. Although the great many advances in English automatic speech recognition (ASR) over the past decades, results are usually reported based on test datasets which fail to represent the diversity of English as spoken today around the globe. We present the first release of The Edinburgh International Accents of English Corpus (EdAcc). This dataset attempts to better represent the wide diversity of English, encompassing almost 40 hours of dyadic video call conversations between friends. Unlike other datasets, EdAcc includes a wide range of first and second-language varieties of English and a linguistic background profile of each speaker. Results on latest public, and commercial models show that EdAcc highlights shortcomings of current English ASR models. The best performing model, trained on 680 thousand hours of transcribed data, obtains an average of 19.7% WER -- in contrast to the the 2.7% WER obtained when evaluated on US English clean read speech. Across all models, we observe a drop in performance on Jamaican, Indonesian, Nigerian, and Kenyan English speakers. Recordings, linguistic backgrounds, data statement, and evaluation scripts are released on our website under CC-BY-SA. README.txt
author2	University of Edinburgh Sanabria, Ramon
format	Dataset
author	Sanabria, Ramon Nikolay, Bogoychev Nina, Markl Carmantini, Andrea Klejch, Ondrej Bell, Peter
author_facet	Sanabria, Ramon Nikolay, Bogoychev Nina, Markl Carmantini, Andrea Klejch, Ondrej Bell, Peter
author_sort	Sanabria, Ramon
title	SUPERSEDED - The Edinburgh International Accents of English Corpus
title_short	SUPERSEDED - The Edinburgh International Accents of English Corpus
title_full	SUPERSEDED - The Edinburgh International Accents of English Corpus
title_fullStr	SUPERSEDED - The Edinburgh International Accents of English Corpus
title_full_unstemmed	SUPERSEDED - The Edinburgh International Accents of English Corpus
title_sort	superseded - the edinburgh international accents of english corpus
publisher	University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research
publishDate	2022
url	https://hdl.handle.net/10283/4766 https://doi.org/10.7488/ds/3780
op_coverage	Global UK UNITED KINGDOM AF AFGHANISTAN AX ÅLAND ISLANDS AL ALBANIA DZ ALGERIA AS AMERICAN SAMOA AD ANDORRA AO ANGOLA AI ANGUILLA AQ ANTARCTICA AG ANTIGUA AND BARBUDA AR ARGENTINA AM ARMENIA AW ARUBA AU AUSTRALIA AT AUSTRIA AZ AZERBAIJAN BS BAHAMAS BH BAHRAIN BD BANGLADESH BB BARBADOS BY BELARUS BE BELGIUM BZ BELIZE BJ BENIN BM BERMUDA BT BHUTAN BO BOLIVIA, PLURINATIONAL STATE OF BQ BONAIRE, SINT EUSTATIUS AND SABA BA BOSNIA AND HERZEGOVINA BW BOTSWANA BV BOUVET ISLAND BR BRAZIL IO BRITISH INDIAN OCEAN TERRITORY BN BRUNEI DARUSSALAM BG BULGARIA BF BURKINA FASO BI BURUNDI KH CAMBODIA CM CAMEROON CA CANADA CV CAPE VERDE KY CAYMAN ISLANDS CF CENTRAL AFRICAN REPUBLIC TD CHAD CL CHILE CN CHINA CX CHRISTMAS ISLAND CC COCOS (KEELING) ISLANDS CO COLOMBIA KM COMOROS CG CONGO CD CONGO, THE DEMOCRATIC REPUBLIC OF THE CK COOK ISLANDS CR COSTA RICA CI CÔTE D'IVOIRE HR CROATIA CU CUBA CW CURAÇAO CY CYPRUS CZ CZECHIA DK DENMARK DJ DJIBOUTI DM DOMINICA DO DOMINICAN REPUBLIC EC ECUADOR EG EGYPT SV EL SALVADOR GQ EQUATORIAL GUINEA ER ERITREA EE ESTONIA SZ ESWATINI ET ETHIOPIA FK FALKLAND ISLANDS (MALVINAS) FO FAROE ISLANDS FJ FIJI FI FINLAND FR FRANCE GF FRENCH GUIANA PF FRENCH POLYNESIA TF FRENCH SOUTHERN TERRITORIES GA GABON GM GAMBIA GE GEORGIA DE GERMANY GH GHANA GI GIBRALTAR GR GREECE GL GREENLAND GD GRENADA GP GUADELOUPE GU GUAM GT GUATEMALA GG GUERNSEY GN GUINEA GW GUINEA-BISSAU GY GUYANA HT HAITI HM HEARD ISLAND AND MCDONALD ISLANDS VA HOLY SEE HN HONDURAS HK HONG KONG HU HUNGARY IS ICELAND IN INDIA ID INDONESIA IR IRAN, ISLAMIC REPUBLIC OF IQ IRAQ IE IRELAND IM ISLE OF MAN IL ISRAEL IT ITALY JM JAMAICA JP JAPAN JE JERSEY JO JORDAN KZ KAZAKHSTAN KE KENYA KI KIRIBATI KP KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF KR KOREA, REPUBLIC OF KW KUWAIT KG KYRGYZSTAN LA LAO PEOPLE'S DEMOCRATIC REPUBLIC LV LATVIA LB LEBANON LS LESOTHO LR LIBERIA LY LIBYA LI LIECHTENSTEIN LT LITHUANIA LU LUXEMBOURG MO MACAO MG MADAGASCAR MW MALAWI MY MALAYSIA MV MALDIVES ML MALI MT MALTA MH MARSHALL ISLANDS MQ MARTINIQUE MR MAURITANIA MU MAURITIUS YT MAYOTTE MX MEXICO FM MICRONESIA, FEDERATED STATES OF MD MOLDOVA, REPUBLIC OF MC MONACO MN MONGOLIA ME MONTENEGRO MS MONTSERRAT MA MOROCCO MZ MOZAMBIQUE MM MYANMAR NA NAMIBIA NR NAURU NP NEPAL NL NETHERLANDS NC NEW CALEDONIA NZ NEW ZEALAND NI NICARAGUA NE NIGER NG NIGERIA NU NIUE NF NORFOLK ISLAND MK NORTH MACEDONIA MP NORTHERN MARIANA ISLANDS NO NORWAY OM OMAN PK PAKISTAN PW PALAU PS PALESTINE, STATE OF PA PANAMA PG PAPUA NEW GUINEA PY PARAGUAY PE PERU PH PHILIPPINES PN PITCAIRN PL POLAND PT PORTUGAL PR PUERTO RICO QA QATAR RE RÉUNION RO ROMANIA RU RUSSIAN FEDERATION RW RWANDA BL SAINT BARTHÉLEMY SH SAINT HELENA, ASCENSION AND TRISTAN DA CUNHA KN SAINT KITTS AND NEVIS LC SAINT LUCIA MF SAINT MARTIN (FRENCH PART) PM SAINT PIERRE AND MIQUELON VC SAINT VINCENT AND THE GRENADINES WS SAMOA SM SAN MARINO ST SAO TOME AND PRINCIPE SA SAUDI ARABIA SN SENEGAL RS SERBIA SC SEYCHELLES SL SIERRA LEONE SG SINGAPORE SX SINT MAARTEN (DUTCH PART) SK SLOVAKIA SI SLOVENIA SB SOLOMON ISLANDS SO SOMALIA ZA SOUTH AFRICA GS SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS SS SOUTH SUDAN ES SPAIN LK SRI LANKA SD SUDAN SR SURINAME SJ SVALBARD AND JAN MAYEN SE SWEDEN CH SWITZERLAND SY SYRIAN ARAB REPUBLIC TW TAIWAN, PROVINCE OF CHINA TJ TAJIKISTAN TZ TANZANIA, UNITED REPUBLIC OF TH THAILAND TL TIMOR-LESTE TG TOGO TK TOKELAU TO TONGA TT TRINIDAD AND TOBAGO TN TUNISIA TR TURKEY TM TURKMENISTAN TC TURKS AND CAICOS ISLANDS TV TUVALU UG UGANDA UA UKRAINE AE UNITED ARAB EMIRATES US UNITED STATES UM UNITED STATES MINOR OUTLYING ISLANDS UY URUGUAY UZ UZBEKISTAN VU VANUATU VE VENEZUELA, BOLIVARIAN REPUBLIC OF VN VIET NAM VG VIRGIN ISLANDS, BRITISH VI VIRGIN ISLANDS, U.S. WF WALLIS AND FUTUNA EH WESTERN SAHARA YE YEMEN ZM ZAMBIA ZW ZIMBABWE
long_lat	ENVELOPE(3.358,3.358,-54.422,-54.422) ENVELOPE(3.358,3.358,-54.422,-54.422) ENVELOPE(-68.267,-68.267,-69.317,-69.317) ENVELOPE(73.510,73.510,-53.117,-53.117) ENVELOPE(73.510,73.510,-53.117,-53.117) ENVELOPE(72.600,72.600,-53.033,-53.033) ENVELOPE(149.417,149.417,66.617,66.617) ENVELOPE(-59.515,-59.515,50.600,50.600) ENVELOPE(-33.000,-33.000,-56.000,-56.000) ENVELOPE(20.000,20.000,78.000,78.000) ENVELOPE(7.990,7.990,63.065,63.065) ENVELOPE(-60.734,-60.734,-63.816,-63.816) ENVELOPE(140.900,140.900,-66.735,-66.735)
geographic	Argentina Bouvet Bouvet Island Canada Faroe Islands Greenland Guernsey Heard Heard Island Heard Island Indian Jan Mayen McDonald Islands New Zealand Norway Saba Saint-Vincent Sandwich Islands South Georgia South Sandwich Islands Svalbard Svalbard Tonga Trinidad Tristan Uruguay
geographic_facet	Argentina Bouvet Bouvet Island Canada Faroe Islands Greenland Guernsey Heard Heard Island Heard Island Indian Jan Mayen McDonald Islands New Zealand Norway Saba Saint-Vincent Sandwich Islands South Georgia South Sandwich Islands Svalbard Svalbard Tonga Trinidad Tristan Uruguay
genre	Antarc* Antarctica Bouvet Island Faroe Islands Greenland Heard Island Iceland Jan Mayen McDonald Islands South Sandwich Islands Svalbard
genre_facet	Antarc* Antarctica Bouvet Island Faroe Islands Greenland Heard Island Iceland Jan Mayen McDonald Islands South Sandwich Islands Svalbard
op_relation	Sanabria, Ramon; Nikolay, Bogoychev; Nina, Markl; Carmantini, Andrea; Klejch, Ondrej; Bell, Peter. (2022). The Edinburgh International Accents of English Corpus, [dataset]. University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research. https://doi.org/10.7488/ds/3780. https://hdl.handle.net/10283/4766 https://doi.org/10.7488/ds/3780
op_rights	CC-BY-SA
op_doi	https://doi.org/10.7488/ds/3780
_version_	1772810197847244800
spelling	ftuedinburgheds:oai:datashare.ed.ac.uk:10283/4766 2023-07-30T03:59:24+02:00 SUPERSEDED - The Edinburgh International Accents of English Corpus Sanabria, Ramon Nikolay, Bogoychev Nina, Markl Carmantini, Andrea Klejch, Ondrej Bell, Peter University of Edinburgh Sanabria, Ramon Global UK UNITED KINGDOM AF AFGHANISTAN AX ÅLAND ISLANDS AL ALBANIA DZ ALGERIA AS AMERICAN SAMOA AD ANDORRA AO ANGOLA AI ANGUILLA AQ ANTARCTICA AG ANTIGUA AND BARBUDA AR ARGENTINA AM ARMENIA AW ARUBA AU AUSTRALIA AT AUSTRIA AZ AZERBAIJAN BS BAHAMAS BH BAHRAIN BD BANGLADESH BB BARBADOS BY BELARUS BE BELGIUM BZ BELIZE BJ BENIN BM BERMUDA BT BHUTAN BO BOLIVIA, PLURINATIONAL STATE OF BQ BONAIRE, SINT EUSTATIUS AND SABA BA BOSNIA AND HERZEGOVINA BW BOTSWANA BV BOUVET ISLAND BR BRAZIL IO BRITISH INDIAN OCEAN TERRITORY BN BRUNEI DARUSSALAM BG BULGARIA BF BURKINA FASO BI BURUNDI KH CAMBODIA CM CAMEROON CA CANADA CV CAPE VERDE KY CAYMAN ISLANDS CF CENTRAL AFRICAN REPUBLIC TD CHAD CL CHILE CN CHINA CX CHRISTMAS ISLAND CC COCOS (KEELING) ISLANDS CO COLOMBIA KM COMOROS CG CONGO CD CONGO, THE DEMOCRATIC REPUBLIC OF THE CK COOK ISLANDS CR COSTA RICA CI CÔTE D'IVOIRE HR CROATIA CU CUBA CW CURAÇAO CY CYPRUS CZ CZECHIA DK DENMARK DJ DJIBOUTI DM DOMINICA DO DOMINICAN REPUBLIC EC ECUADOR EG EGYPT SV EL SALVADOR GQ EQUATORIAL GUINEA ER ERITREA EE ESTONIA SZ ESWATINI ET ETHIOPIA FK FALKLAND ISLANDS (MALVINAS) FO FAROE ISLANDS FJ FIJI FI FINLAND FR FRANCE GF FRENCH GUIANA PF FRENCH POLYNESIA TF FRENCH SOUTHERN TERRITORIES GA GABON GM GAMBIA GE GEORGIA DE GERMANY GH GHANA GI GIBRALTAR GR GREECE GL GREENLAND GD GRENADA GP GUADELOUPE GU GUAM GT GUATEMALA GG GUERNSEY GN GUINEA GW GUINEA-BISSAU GY GUYANA HT HAITI HM HEARD ISLAND AND MCDONALD ISLANDS VA HOLY SEE HN HONDURAS HK HONG KONG HU HUNGARY IS ICELAND IN INDIA ID INDONESIA IR IRAN, ISLAMIC REPUBLIC OF IQ IRAQ IE IRELAND IM ISLE OF MAN IL ISRAEL IT ITALY JM JAMAICA JP JAPAN JE JERSEY JO JORDAN KZ KAZAKHSTAN KE KENYA KI KIRIBATI KP KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF KR KOREA, REPUBLIC OF KW KUWAIT KG KYRGYZSTAN LA LAO PEOPLE'S DEMOCRATIC REPUBLIC LV LATVIA LB LEBANON LS LESOTHO LR LIBERIA LY LIBYA LI LIECHTENSTEIN LT LITHUANIA LU LUXEMBOURG MO MACAO MG MADAGASCAR MW MALAWI MY MALAYSIA MV MALDIVES ML MALI MT MALTA MH MARSHALL ISLANDS MQ MARTINIQUE MR MAURITANIA MU MAURITIUS YT MAYOTTE MX MEXICO FM MICRONESIA, FEDERATED STATES OF MD MOLDOVA, REPUBLIC OF MC MONACO MN MONGOLIA ME MONTENEGRO MS MONTSERRAT MA MOROCCO MZ MOZAMBIQUE MM MYANMAR NA NAMIBIA NR NAURU NP NEPAL NL NETHERLANDS NC NEW CALEDONIA NZ NEW ZEALAND NI NICARAGUA NE NIGER NG NIGERIA NU NIUE NF NORFOLK ISLAND MK NORTH MACEDONIA MP NORTHERN MARIANA ISLANDS NO NORWAY OM OMAN PK PAKISTAN PW PALAU PS PALESTINE, STATE OF PA PANAMA PG PAPUA NEW GUINEA PY PARAGUAY PE PERU PH PHILIPPINES PN PITCAIRN PL POLAND PT PORTUGAL PR PUERTO RICO QA QATAR RE RÉUNION RO ROMANIA RU RUSSIAN FEDERATION RW RWANDA BL SAINT BARTHÉLEMY SH SAINT HELENA, ASCENSION AND TRISTAN DA CUNHA KN SAINT KITTS AND NEVIS LC SAINT LUCIA MF SAINT MARTIN (FRENCH PART) PM SAINT PIERRE AND MIQUELON VC SAINT VINCENT AND THE GRENADINES WS SAMOA SM SAN MARINO ST SAO TOME AND PRINCIPE SA SAUDI ARABIA SN SENEGAL RS SERBIA SC SEYCHELLES SL SIERRA LEONE SG SINGAPORE SX SINT MAARTEN (DUTCH PART) SK SLOVAKIA SI SLOVENIA SB SOLOMON ISLANDS SO SOMALIA ZA SOUTH AFRICA GS SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS SS SOUTH SUDAN ES SPAIN LK SRI LANKA SD SUDAN SR SURINAME SJ SVALBARD AND JAN MAYEN SE SWEDEN CH SWITZERLAND SY SYRIAN ARAB REPUBLIC TW TAIWAN, PROVINCE OF CHINA TJ TAJIKISTAN TZ TANZANIA, UNITED REPUBLIC OF TH THAILAND TL TIMOR-LESTE TG TOGO TK TOKELAU TO TONGA TT TRINIDAD AND TOBAGO TN TUNISIA TR TURKEY TM TURKMENISTAN TC TURKS AND CAICOS ISLANDS TV TUVALU UG UGANDA UA UKRAINE AE UNITED ARAB EMIRATES US UNITED STATES UM UNITED STATES MINOR OUTLYING ISLANDS UY URUGUAY UZ UZBEKISTAN VU VANUATU VE VENEZUELA, BOLIVARIAN REPUBLIC OF VN VIET NAM VG VIRGIN ISLANDS, BRITISH VI VIRGIN ISLANDS, U.S. WF WALLIS AND FUTUNA EH WESTERN SAHARA YE YEMEN ZM ZAMBIA ZW ZIMBABWE 2022-11-07T14:43:56Z application/zip text/csv https://hdl.handle.net/10283/4766 https://doi.org/10.7488/ds/3780 eng eng University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research Sanabria, Ramon; Nikolay, Bogoychev; Nina, Markl; Carmantini, Andrea; Klejch, Ondrej; Bell, Peter. (2022). The Edinburgh International Accents of English Corpus, [dataset]. University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research. https://doi.org/10.7488/ds/3780. https://hdl.handle.net/10283/4766 https://doi.org/10.7488/ds/3780 CC-BY-SA conversational speech bias in speech recognition English accents Linguistics Linguistics Classics and related subjects::English as a second language dataset 2022 ftuedinburgheds https://doi.org/10.7488/ds/3780 2023-07-09T20:29:26Z ## This item has been replaced by the one which can be found at https://datashare.ed.ac.uk/handle/10283/4836 - https://doi.org/10.7488/ds/3832 ##. English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a result, there are many varieties of English. Although the great many advances in English automatic speech recognition (ASR) over the past decades, results are usually reported based on test datasets which fail to represent the diversity of English as spoken today around the globe. We present the first release of The Edinburgh International Accents of English Corpus (EdAcc). This dataset attempts to better represent the wide diversity of English, encompassing almost 40 hours of dyadic video call conversations between friends. Unlike other datasets, EdAcc includes a wide range of first and second-language varieties of English and a linguistic background profile of each speaker. Results on latest public, and commercial models show that EdAcc highlights shortcomings of current English ASR models. The best performing model, trained on 680 thousand hours of transcribed data, obtains an average of 19.7% WER -- in contrast to the the 2.7% WER obtained when evaluated on US English clean read speech. Across all models, we observe a drop in performance on Jamaican, Indonesian, Nigerian, and Kenyan English speakers. Recordings, linguistic backgrounds, data statement, and evaluation scripts are released on our website under CC-BY-SA. README.txt Dataset Antarc* Antarctica Bouvet Island Faroe Islands Greenland Heard Island Iceland Jan Mayen McDonald Islands South Sandwich Islands Svalbard Edinburgh DataShare (University of Edinburgh) Argentina Bouvet ENVELOPE(3.358,3.358,-54.422,-54.422) Bouvet Island ENVELOPE(3.358,3.358,-54.422,-54.422) Canada Faroe Islands Greenland Guernsey ENVELOPE(-68.267,-68.267,-69.317,-69.317) Heard ENVELOPE(73.510,73.510,-53.117,-53.117) Heard Island Heard Island ENVELOPE(73.510,73.510,-53.117,-53.117) Indian Jan Mayen McDonald Islands ENVELOPE(72.600,72.600,-53.033,-53.033) New Zealand Norway Saba ENVELOPE(149.417,149.417,66.617,66.617) Saint-Vincent ENVELOPE(-59.515,-59.515,50.600,50.600) Sandwich Islands South Georgia ENVELOPE(-33.000,-33.000,-56.000,-56.000) South Sandwich Islands Svalbard Svalbard ENVELOPE(20.000,20.000,78.000,78.000) Tonga ENVELOPE(7.990,7.990,63.065,63.065) Trinidad ENVELOPE(-60.734,-60.734,-63.816,-63.816) Tristan ENVELOPE(140.900,140.900,-66.735,-66.735) Uruguay

SUPERSEDED - The Edinburgh International Accents of English Corpus

Similar Items