SUPERSEDED - The Edinburgh International Accents of English Corpus

## This item has been replaced by the one which can be found at https://datashare.ed.ac.uk/handle/10283/4836 - https://doi.org/10.7488/ds/3832 ##. English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a...

Full description

Bibliographic Details
Main Authors: Sanabria, Ramon, Nikolay, Bogoychev, Nina, Markl, Carmantini, Andrea, Klejch, Ondrej, Bell, Peter
Other Authors: University of Edinburgh
Format: Dataset
Language:English
Published: University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research 2022
Subjects:
Online Access:https://hdl.handle.net/10283/4766
https://doi.org/10.7488/ds/3780
id ftuedinburgheds:oai:datashare.ed.ac.uk:10283/4766
record_format openpolar
institution Open Polar
collection Edinburgh DataShare (University of Edinburgh)
op_collection_id ftuedinburgheds
language English
topic conversational speech
bias in speech recognition
English accents
Linguistics
Linguistics Classics and related subjects::English as a second language
spellingShingle conversational speech
bias in speech recognition
English accents
Linguistics
Linguistics Classics and related subjects::English as a second language
Sanabria, Ramon
Nikolay, Bogoychev
Nina, Markl
Carmantini, Andrea
Klejch, Ondrej
Bell, Peter
SUPERSEDED - The Edinburgh International Accents of English Corpus
topic_facet conversational speech
bias in speech recognition
English accents
Linguistics
Linguistics Classics and related subjects::English as a second language
description ## This item has been replaced by the one which can be found at https://datashare.ed.ac.uk/handle/10283/4836 - https://doi.org/10.7488/ds/3832 ##. English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a result, there are many varieties of English. Although the great many advances in English automatic speech recognition (ASR) over the past decades, results are usually reported based on test datasets which fail to represent the diversity of English as spoken today around the globe. We present the first release of The Edinburgh International Accents of English Corpus (EdAcc). This dataset attempts to better represent the wide diversity of English, encompassing almost 40 hours of dyadic video call conversations between friends. Unlike other datasets, EdAcc includes a wide range of first and second-language varieties of English and a linguistic background profile of each speaker. Results on latest public, and commercial models show that EdAcc highlights shortcomings of current English ASR models. The best performing model, trained on 680 thousand hours of transcribed data, obtains an average of 19.7% WER -- in contrast to the the 2.7% WER obtained when evaluated on US English clean read speech. Across all models, we observe a drop in performance on Jamaican, Indonesian, Nigerian, and Kenyan English speakers. Recordings, linguistic backgrounds, data statement, and evaluation scripts are released on our website under CC-BY-SA. README.txt
author2 University of Edinburgh
Sanabria, Ramon
format Dataset
author Sanabria, Ramon
Nikolay, Bogoychev
Nina, Markl
Carmantini, Andrea
Klejch, Ondrej
Bell, Peter
author_facet Sanabria, Ramon
Nikolay, Bogoychev
Nina, Markl
Carmantini, Andrea
Klejch, Ondrej
Bell, Peter
author_sort Sanabria, Ramon
title SUPERSEDED - The Edinburgh International Accents of English Corpus
title_short SUPERSEDED - The Edinburgh International Accents of English Corpus
title_full SUPERSEDED - The Edinburgh International Accents of English Corpus
title_fullStr SUPERSEDED - The Edinburgh International Accents of English Corpus
title_full_unstemmed SUPERSEDED - The Edinburgh International Accents of English Corpus
title_sort superseded - the edinburgh international accents of english corpus
publisher University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research
publishDate 2022
url https://hdl.handle.net/10283/4766
https://doi.org/10.7488/ds/3780
op_coverage Global
UK
UNITED KINGDOM
AF
AFGHANISTAN
AX
ÅLAND ISLANDS
AL
ALBANIA
DZ
ALGERIA
AS
AMERICAN SAMOA
AD
ANDORRA
AO
ANGOLA
AI
ANGUILLA
AQ
ANTARCTICA
AG
ANTIGUA AND BARBUDA
AR
ARGENTINA
AM
ARMENIA
AW
ARUBA
AU
AUSTRALIA
AT
AUSTRIA
AZ
AZERBAIJAN
BS
BAHAMAS
BH
BAHRAIN
BD
BANGLADESH
BB
BARBADOS
BY
BELARUS
BE
BELGIUM
BZ
BELIZE
BJ
BENIN
BM
BERMUDA
BT
BHUTAN
BO
BOLIVIA, PLURINATIONAL STATE OF
BQ
BONAIRE, SINT EUSTATIUS AND SABA
BA
BOSNIA AND HERZEGOVINA
BW
BOTSWANA
BV
BOUVET ISLAND
BR
BRAZIL
IO
BRITISH INDIAN OCEAN TERRITORY
BN
BRUNEI DARUSSALAM
BG
BULGARIA
BF
BURKINA FASO
BI
BURUNDI
KH
CAMBODIA
CM
CAMEROON
CA
CANADA
CV
CAPE VERDE
KY
CAYMAN ISLANDS
CF
CENTRAL AFRICAN REPUBLIC
TD
CHAD
CL
CHILE
CN
CHINA
CX
CHRISTMAS ISLAND
CC
COCOS (KEELING) ISLANDS
CO
COLOMBIA
KM
COMOROS
CG
CONGO
CD
CONGO, THE DEMOCRATIC REPUBLIC OF THE
CK
COOK ISLANDS
CR
COSTA RICA
CI
CÔTE D'IVOIRE
HR
CROATIA
CU
CUBA
CW
CURAÇAO
CY
CYPRUS
CZ
CZECHIA
DK
DENMARK
DJ
DJIBOUTI
DM
DOMINICA
DO
DOMINICAN REPUBLIC
EC
ECUADOR
EG
EGYPT
SV
EL SALVADOR
GQ
EQUATORIAL GUINEA
ER
ERITREA
EE
ESTONIA
SZ
ESWATINI
ET
ETHIOPIA
FK
FALKLAND ISLANDS (MALVINAS)
FO
FAROE ISLANDS
FJ
FIJI
FI
FINLAND
FR
FRANCE
GF
FRENCH GUIANA
PF
FRENCH POLYNESIA
TF
FRENCH SOUTHERN TERRITORIES
GA
GABON
GM
GAMBIA
GE
GEORGIA
DE
GERMANY
GH
GHANA
GI
GIBRALTAR
GR
GREECE
GL
GREENLAND
GD
GRENADA
GP
GUADELOUPE
GU
GUAM
GT
GUATEMALA
GG
GUERNSEY
GN
GUINEA
GW
GUINEA-BISSAU
GY
GUYANA
HT
HAITI
HM
HEARD ISLAND AND MCDONALD ISLANDS
VA
HOLY SEE
HN
HONDURAS
HK
HONG KONG
HU
HUNGARY
IS
ICELAND
IN
INDIA
ID
INDONESIA
IR
IRAN, ISLAMIC REPUBLIC OF
IQ
IRAQ
IE
IRELAND
IM
ISLE OF MAN
IL
ISRAEL
IT
ITALY
JM
JAMAICA
JP
JAPAN
JE
JERSEY
JO
JORDAN
KZ
KAZAKHSTAN
KE
KENYA
KI
KIRIBATI
KP
KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF
KR
KOREA, REPUBLIC OF
KW
KUWAIT
KG
KYRGYZSTAN
LA
LAO PEOPLE'S DEMOCRATIC REPUBLIC
LV
LATVIA
LB
LEBANON
LS
LESOTHO
LR
LIBERIA
LY
LIBYA
LI
LIECHTENSTEIN
LT
LITHUANIA
LU
LUXEMBOURG
MO
MACAO
MG
MADAGASCAR
MW
MALAWI
MY
MALAYSIA
MV
MALDIVES
ML
MALI
MT
MALTA
MH
MARSHALL ISLANDS
MQ
MARTINIQUE
MR
MAURITANIA
MU
MAURITIUS
YT
MAYOTTE
MX
MEXICO
FM
MICRONESIA, FEDERATED STATES OF
MD
MOLDOVA, REPUBLIC OF
MC
MONACO
MN
MONGOLIA
ME
MONTENEGRO
MS
MONTSERRAT
MA
MOROCCO
MZ
MOZAMBIQUE
MM
MYANMAR
NA
NAMIBIA
NR
NAURU
NP
NEPAL
NL
NETHERLANDS
NC
NEW CALEDONIA
NZ
NEW ZEALAND
NI
NICARAGUA
NE
NIGER
NG
NIGERIA
NU
NIUE
NF
NORFOLK ISLAND
MK
NORTH MACEDONIA
MP
NORTHERN MARIANA ISLANDS
NO
NORWAY
OM
OMAN
PK
PAKISTAN
PW
PALAU
PS
PALESTINE, STATE OF
PA
PANAMA
PG
PAPUA NEW GUINEA
PY
PARAGUAY
PE
PERU
PH
PHILIPPINES
PN
PITCAIRN
PL
POLAND
PT
PORTUGAL
PR
PUERTO RICO
QA
QATAR
RE
RÉUNION
RO
ROMANIA
RU
RUSSIAN FEDERATION
RW
RWANDA
BL
SAINT BARTHÉLEMY
SH
SAINT HELENA, ASCENSION AND TRISTAN DA CUNHA
KN
SAINT KITTS AND NEVIS
LC
SAINT LUCIA
MF
SAINT MARTIN (FRENCH PART)
PM
SAINT PIERRE AND MIQUELON
VC
SAINT VINCENT AND THE GRENADINES
WS
SAMOA
SM
SAN MARINO
ST
SAO TOME AND PRINCIPE
SA
SAUDI ARABIA
SN
SENEGAL
RS
SERBIA
SC
SEYCHELLES
SL
SIERRA LEONE
SG
SINGAPORE
SX
SINT MAARTEN (DUTCH PART)
SK
SLOVAKIA
SI
SLOVENIA
SB
SOLOMON ISLANDS
SO
SOMALIA
ZA
SOUTH AFRICA
GS
SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS
SS
SOUTH SUDAN
ES
SPAIN
LK
SRI LANKA
SD
SUDAN
SR
SURINAME
SJ
SVALBARD AND JAN MAYEN
SE
SWEDEN
CH
SWITZERLAND
SY
SYRIAN ARAB REPUBLIC
TW
TAIWAN, PROVINCE OF CHINA
TJ
TAJIKISTAN
TZ
TANZANIA, UNITED REPUBLIC OF
TH
THAILAND
TL
TIMOR-LESTE
TG
TOGO
TK
TOKELAU
TO
TONGA
TT
TRINIDAD AND TOBAGO
TN
TUNISIA
TR
TURKEY
TM
TURKMENISTAN
TC
TURKS AND CAICOS ISLANDS
TV
TUVALU
UG
UGANDA
UA
UKRAINE
AE
UNITED ARAB EMIRATES
US
UNITED STATES
UM
UNITED STATES MINOR OUTLYING ISLANDS
UY
URUGUAY
UZ
UZBEKISTAN
VU
VANUATU
VE
VENEZUELA, BOLIVARIAN REPUBLIC OF
VN
VIET NAM
VG
VIRGIN ISLANDS, BRITISH
VI
VIRGIN ISLANDS, U.S.
WF
WALLIS AND FUTUNA
EH
WESTERN SAHARA
YE
YEMEN
ZM
ZAMBIA
ZW
ZIMBABWE
long_lat ENVELOPE(3.358,3.358,-54.422,-54.422)
ENVELOPE(3.358,3.358,-54.422,-54.422)
ENVELOPE(-68.267,-68.267,-69.317,-69.317)
ENVELOPE(73.510,73.510,-53.117,-53.117)
ENVELOPE(73.510,73.510,-53.117,-53.117)
ENVELOPE(72.600,72.600,-53.033,-53.033)
ENVELOPE(149.417,149.417,66.617,66.617)
ENVELOPE(-59.515,-59.515,50.600,50.600)
ENVELOPE(-33.000,-33.000,-56.000,-56.000)
ENVELOPE(20.000,20.000,78.000,78.000)
ENVELOPE(7.990,7.990,63.065,63.065)
ENVELOPE(-60.734,-60.734,-63.816,-63.816)
ENVELOPE(140.900,140.900,-66.735,-66.735)
geographic Argentina
Bouvet
Bouvet Island
Canada
Faroe Islands
Greenland
Guernsey
Heard
Heard Island
Heard Island
Indian
Jan Mayen
McDonald Islands
New Zealand
Norway
Saba
Saint-Vincent
Sandwich Islands
South Georgia
South Sandwich Islands
Svalbard
Svalbard
Tonga
Trinidad
Tristan
Uruguay
geographic_facet Argentina
Bouvet
Bouvet Island
Canada
Faroe Islands
Greenland
Guernsey
Heard
Heard Island
Heard Island
Indian
Jan Mayen
McDonald Islands
New Zealand
Norway
Saba
Saint-Vincent
Sandwich Islands
South Georgia
South Sandwich Islands
Svalbard
Svalbard
Tonga
Trinidad
Tristan
Uruguay
genre Antarc*
Antarctica
Bouvet Island
Faroe Islands
Greenland
Heard Island
Iceland
Jan Mayen
McDonald Islands
South Sandwich Islands
Svalbard
genre_facet Antarc*
Antarctica
Bouvet Island
Faroe Islands
Greenland
Heard Island
Iceland
Jan Mayen
McDonald Islands
South Sandwich Islands
Svalbard
op_relation Sanabria, Ramon; Nikolay, Bogoychev; Nina, Markl; Carmantini, Andrea; Klejch, Ondrej; Bell, Peter. (2022). The Edinburgh International Accents of English Corpus, [dataset]. University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research. https://doi.org/10.7488/ds/3780.
https://hdl.handle.net/10283/4766
https://doi.org/10.7488/ds/3780
op_rights CC-BY-SA
op_doi https://doi.org/10.7488/ds/3780
_version_ 1772810197847244800
spelling ftuedinburgheds:oai:datashare.ed.ac.uk:10283/4766 2023-07-30T03:59:24+02:00 SUPERSEDED - The Edinburgh International Accents of English Corpus Sanabria, Ramon Nikolay, Bogoychev Nina, Markl Carmantini, Andrea Klejch, Ondrej Bell, Peter University of Edinburgh Sanabria, Ramon Global UK UNITED KINGDOM AF AFGHANISTAN AX ÅLAND ISLANDS AL ALBANIA DZ ALGERIA AS AMERICAN SAMOA AD ANDORRA AO ANGOLA AI ANGUILLA AQ ANTARCTICA AG ANTIGUA AND BARBUDA AR ARGENTINA AM ARMENIA AW ARUBA AU AUSTRALIA AT AUSTRIA AZ AZERBAIJAN BS BAHAMAS BH BAHRAIN BD BANGLADESH BB BARBADOS BY BELARUS BE BELGIUM BZ BELIZE BJ BENIN BM BERMUDA BT BHUTAN BO BOLIVIA, PLURINATIONAL STATE OF BQ BONAIRE, SINT EUSTATIUS AND SABA BA BOSNIA AND HERZEGOVINA BW BOTSWANA BV BOUVET ISLAND BR BRAZIL IO BRITISH INDIAN OCEAN TERRITORY BN BRUNEI DARUSSALAM BG BULGARIA BF BURKINA FASO BI BURUNDI KH CAMBODIA CM CAMEROON CA CANADA CV CAPE VERDE KY CAYMAN ISLANDS CF CENTRAL AFRICAN REPUBLIC TD CHAD CL CHILE CN CHINA CX CHRISTMAS ISLAND CC COCOS (KEELING) ISLANDS CO COLOMBIA KM COMOROS CG CONGO CD CONGO, THE DEMOCRATIC REPUBLIC OF THE CK COOK ISLANDS CR COSTA RICA CI CÔTE D'IVOIRE HR CROATIA CU CUBA CW CURAÇAO CY CYPRUS CZ CZECHIA DK DENMARK DJ DJIBOUTI DM DOMINICA DO DOMINICAN REPUBLIC EC ECUADOR EG EGYPT SV EL SALVADOR GQ EQUATORIAL GUINEA ER ERITREA EE ESTONIA SZ ESWATINI ET ETHIOPIA FK FALKLAND ISLANDS (MALVINAS) FO FAROE ISLANDS FJ FIJI FI FINLAND FR FRANCE GF FRENCH GUIANA PF FRENCH POLYNESIA TF FRENCH SOUTHERN TERRITORIES GA GABON GM GAMBIA GE GEORGIA DE GERMANY GH GHANA GI GIBRALTAR GR GREECE GL GREENLAND GD GRENADA GP GUADELOUPE GU GUAM GT GUATEMALA GG GUERNSEY GN GUINEA GW GUINEA-BISSAU GY GUYANA HT HAITI HM HEARD ISLAND AND MCDONALD ISLANDS VA HOLY SEE HN HONDURAS HK HONG KONG HU HUNGARY IS ICELAND IN INDIA ID INDONESIA IR IRAN, ISLAMIC REPUBLIC OF IQ IRAQ IE IRELAND IM ISLE OF MAN IL ISRAEL IT ITALY JM JAMAICA JP JAPAN JE JERSEY JO JORDAN KZ KAZAKHSTAN KE KENYA KI KIRIBATI KP KOREA, DEMOCRATIC PEOPLE'S REPUBLIC OF KR KOREA, REPUBLIC OF KW KUWAIT KG KYRGYZSTAN LA LAO PEOPLE'S DEMOCRATIC REPUBLIC LV LATVIA LB LEBANON LS LESOTHO LR LIBERIA LY LIBYA LI LIECHTENSTEIN LT LITHUANIA LU LUXEMBOURG MO MACAO MG MADAGASCAR MW MALAWI MY MALAYSIA MV MALDIVES ML MALI MT MALTA MH MARSHALL ISLANDS MQ MARTINIQUE MR MAURITANIA MU MAURITIUS YT MAYOTTE MX MEXICO FM MICRONESIA, FEDERATED STATES OF MD MOLDOVA, REPUBLIC OF MC MONACO MN MONGOLIA ME MONTENEGRO MS MONTSERRAT MA MOROCCO MZ MOZAMBIQUE MM MYANMAR NA NAMIBIA NR NAURU NP NEPAL NL NETHERLANDS NC NEW CALEDONIA NZ NEW ZEALAND NI NICARAGUA NE NIGER NG NIGERIA NU NIUE NF NORFOLK ISLAND MK NORTH MACEDONIA MP NORTHERN MARIANA ISLANDS NO NORWAY OM OMAN PK PAKISTAN PW PALAU PS PALESTINE, STATE OF PA PANAMA PG PAPUA NEW GUINEA PY PARAGUAY PE PERU PH PHILIPPINES PN PITCAIRN PL POLAND PT PORTUGAL PR PUERTO RICO QA QATAR RE RÉUNION RO ROMANIA RU RUSSIAN FEDERATION RW RWANDA BL SAINT BARTHÉLEMY SH SAINT HELENA, ASCENSION AND TRISTAN DA CUNHA KN SAINT KITTS AND NEVIS LC SAINT LUCIA MF SAINT MARTIN (FRENCH PART) PM SAINT PIERRE AND MIQUELON VC SAINT VINCENT AND THE GRENADINES WS SAMOA SM SAN MARINO ST SAO TOME AND PRINCIPE SA SAUDI ARABIA SN SENEGAL RS SERBIA SC SEYCHELLES SL SIERRA LEONE SG SINGAPORE SX SINT MAARTEN (DUTCH PART) SK SLOVAKIA SI SLOVENIA SB SOLOMON ISLANDS SO SOMALIA ZA SOUTH AFRICA GS SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS SS SOUTH SUDAN ES SPAIN LK SRI LANKA SD SUDAN SR SURINAME SJ SVALBARD AND JAN MAYEN SE SWEDEN CH SWITZERLAND SY SYRIAN ARAB REPUBLIC TW TAIWAN, PROVINCE OF CHINA TJ TAJIKISTAN TZ TANZANIA, UNITED REPUBLIC OF TH THAILAND TL TIMOR-LESTE TG TOGO TK TOKELAU TO TONGA TT TRINIDAD AND TOBAGO TN TUNISIA TR TURKEY TM TURKMENISTAN TC TURKS AND CAICOS ISLANDS TV TUVALU UG UGANDA UA UKRAINE AE UNITED ARAB EMIRATES US UNITED STATES UM UNITED STATES MINOR OUTLYING ISLANDS UY URUGUAY UZ UZBEKISTAN VU VANUATU VE VENEZUELA, BOLIVARIAN REPUBLIC OF VN VIET NAM VG VIRGIN ISLANDS, BRITISH VI VIRGIN ISLANDS, U.S. WF WALLIS AND FUTUNA EH WESTERN SAHARA YE YEMEN ZM ZAMBIA ZW ZIMBABWE 2022-11-07T14:43:56Z application/zip text/csv https://hdl.handle.net/10283/4766 https://doi.org/10.7488/ds/3780 eng eng University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research Sanabria, Ramon; Nikolay, Bogoychev; Nina, Markl; Carmantini, Andrea; Klejch, Ondrej; Bell, Peter. (2022). The Edinburgh International Accents of English Corpus, [dataset]. University of Edinburgh. School of Informatics. The Institute for Language, Cognition and Computation. The Centre for Speech Technology Research. https://doi.org/10.7488/ds/3780. https://hdl.handle.net/10283/4766 https://doi.org/10.7488/ds/3780 CC-BY-SA conversational speech bias in speech recognition English accents Linguistics Linguistics Classics and related subjects::English as a second language dataset 2022 ftuedinburgheds https://doi.org/10.7488/ds/3780 2023-07-09T20:29:26Z ## This item has been replaced by the one which can be found at https://datashare.ed.ac.uk/handle/10283/4836 - https://doi.org/10.7488/ds/3832 ##. English is the most widely spoken language in the world, used daily by millions of people as a first or second language in many different contexts. As a result, there are many varieties of English. Although the great many advances in English automatic speech recognition (ASR) over the past decades, results are usually reported based on test datasets which fail to represent the diversity of English as spoken today around the globe. We present the first release of The Edinburgh International Accents of English Corpus (EdAcc). This dataset attempts to better represent the wide diversity of English, encompassing almost 40 hours of dyadic video call conversations between friends. Unlike other datasets, EdAcc includes a wide range of first and second-language varieties of English and a linguistic background profile of each speaker. Results on latest public, and commercial models show that EdAcc highlights shortcomings of current English ASR models. The best performing model, trained on 680 thousand hours of transcribed data, obtains an average of 19.7% WER -- in contrast to the the 2.7% WER obtained when evaluated on US English clean read speech. Across all models, we observe a drop in performance on Jamaican, Indonesian, Nigerian, and Kenyan English speakers. Recordings, linguistic backgrounds, data statement, and evaluation scripts are released on our website under CC-BY-SA. README.txt Dataset Antarc* Antarctica Bouvet Island Faroe Islands Greenland Heard Island Iceland Jan Mayen McDonald Islands South Sandwich Islands Svalbard Edinburgh DataShare (University of Edinburgh) Argentina Bouvet ENVELOPE(3.358,3.358,-54.422,-54.422) Bouvet Island ENVELOPE(3.358,3.358,-54.422,-54.422) Canada Faroe Islands Greenland Guernsey ENVELOPE(-68.267,-68.267,-69.317,-69.317) Heard ENVELOPE(73.510,73.510,-53.117,-53.117) Heard Island Heard Island ENVELOPE(73.510,73.510,-53.117,-53.117) Indian Jan Mayen McDonald Islands ENVELOPE(72.600,72.600,-53.033,-53.033) New Zealand Norway Saba ENVELOPE(149.417,149.417,66.617,66.617) Saint-Vincent ENVELOPE(-59.515,-59.515,50.600,50.600) Sandwich Islands South Georgia ENVELOPE(-33.000,-33.000,-56.000,-56.000) South Sandwich Islands Svalbard Svalbard ENVELOPE(20.000,20.000,78.000,78.000) Tonga ENVELOPE(7.990,7.990,63.065,63.065) Trinidad ENVELOPE(-60.734,-60.734,-63.816,-63.816) Tristan ENVELOPE(140.900,140.900,-66.735,-66.735) Uruguay