HornMT – Machine Translation Benchmark Dataset for Languages in the Horn of Africa

The HornMT repository contains data and the associated metadata for the project Machine Translation Benchmark Dataset for Languages in the Horn of Africa. It is a multi-way parallel corpus that will serve as a benchmark to accelerate progress in machine translation research and production systems fo...

Full description

Bibliographic Details
Main Authors: Hadgu, Asmelash Teka, Gebremeskel, Gebrekirstos G., Aregawi, Abel
Format: Dataset
Language:English
Published: Zenodo 2022
Subjects:
Online Access:https://dx.doi.org/10.5281/zenodo.6369442
https://zenodo.org/record/6369442
id ftdatacite:10.5281/zenodo.6369442
record_format openpolar
institution Open Polar
collection DataCite Metadata Store (German National Library of Science and Technology)
op_collection_id ftdatacite
language English
topic Machine Translation
Parallel Corpus
Horn of Africa
Ethiopia
Natural Language Processing
Afar
Amharic
Oromo
Somali
Tigrinya
spellingShingle Machine Translation
Parallel Corpus
Horn of Africa
Ethiopia
Natural Language Processing
Afar
Amharic
Oromo
Somali
Tigrinya
Hadgu, Asmelash Teka
Gebremeskel, Gebrekirstos G.
Aregawi, Abel
HornMT – Machine Translation Benchmark Dataset for Languages in the Horn of Africa
topic_facet Machine Translation
Parallel Corpus
Horn of Africa
Ethiopia
Natural Language Processing
Afar
Amharic
Oromo
Somali
Tigrinya
description The HornMT repository contains data and the associated metadata for the project Machine Translation Benchmark Dataset for Languages in the Horn of Africa. It is a multi-way parallel corpus that will serve as a benchmark to accelerate progress in machine translation research and production systems for languages in the Horn of Africa. Supported Languages Language ISO 639-3 code Afar aaf Amharic amh English eng Oromo orm Somali som Tigrinya tir data/ contains one text file per language and each file contains news snippets in the same order for each language. data ├── aar.txt ├── amh.txt ├── eng.txt ├── orm.txt ├── som.txt └── tir.txt metadata.tsv contains tab separated data describing each news snippet. The metadata contains the following fields. Scope - describes whether the news is global or local. It takes two values: Global news and Local news. Category - News category covering the following 12 topics Art and Culture Business and Economy Conflicts and Attacks Disaster and Accidents Entertainment Environment Health International Relations Law and Crime Politics Science and Technology Sport Source - List of one or more URLs from which the news content is extracted or based on. Domain - TLD corresponding to the URL(s) in Source. Date - The publication date of the source article. The format is yyyy-mm-dd. Other formats All the data and associated metadata together in one file is also available in other file formats. HornMT.xlsx - data and associated metadata in xlsx format. HornMT.json - data and associated metadata in json format. Below is an example row. <code class="language-javascript">{ "data":{ "eng":"The World Meteorological Organisation reports that the ozone layer is damaged to its worst extent ever in the Arctic.", "aaf":"Baad Metrolojih Eglali Areketekeh Addal Ozonih qelu faxe waktik lafetle calat biyakisem xayose.", "amh":"የአለም የአየር ንብረት ድርጅት በአርክቲክ አካባቢ ያለው የኦዞን ምንጣፍ ከፍተኛ ጉዳት እንደደረሰበት አስታወቀ፡፡", "orm":"Dhaabbanni Meetiroolojii Addunyaa baqqaanni oozonii Arkiitik keessatti gara sadarkaa isa hamaa haga ammaatti akka miidhame gabaase.", "som":"Ururka Saadaasha Hawada Adduunka ayaa ku warramaya in lakabka ozoneka ee Ka koreeya dhulka baraflayda uu waxyeelladii abid ugu darnaa soo gaadhay.", "tir":"ውድብ ሜትሮሎጂ ዓለም ኣብ ኣርክቲክ ዝርከብ ናሕሲ ኦዞን ኣዝዩ ብዝኸፍአ ደረጃ ከምዝተጎድአ ሓቢሩ፡፡" }, "metadata":{ "scope":"Global", "category":"Science and Technology", "source":"https://www.independent.co.uk/environment/climate-change/ozone-layer-damaged-by-unusually-harsh-winter-2263653.html", "domain":"www.independent.co.uk", "date":"2011-04-05" } } Team Afar Mohammed Deresa Yasin Nur Amharic Tigist Taye Selamawit Hailemariam Wako Tilahun Oromo Gemechis Melkamu Galata Girmaye Somali Abdiselam Mohamed Beshir Abdi Tigrinya Berhanu Abadi Weldegiorgis Michael Minassie Nureddin Mohammedshiek Project Leaders Asmelash Teka Hadgu asme@lesan.ai Gebrekirstos G. Gebremeskel gebrekirstos.gebremeskel@ru.nl Abel Aregawi abel@lesan.ai License Shield: CC BY 4.0 This work is licensed under a Creative Commons Attribution 4.0 International License. : This work was carried out with support from Lacuna Fund, an initiative co‐founded by The Rockefeller Foundation, Google.org, and Canada's International Development Research Centre.
format Dataset
author Hadgu, Asmelash Teka
Gebremeskel, Gebrekirstos G.
Aregawi, Abel
author_facet Hadgu, Asmelash Teka
Gebremeskel, Gebrekirstos G.
Aregawi, Abel
author_sort Hadgu, Asmelash Teka
title HornMT – Machine Translation Benchmark Dataset for Languages in the Horn of Africa
title_short HornMT – Machine Translation Benchmark Dataset for Languages in the Horn of Africa
title_full HornMT – Machine Translation Benchmark Dataset for Languages in the Horn of Africa
title_fullStr HornMT – Machine Translation Benchmark Dataset for Languages in the Horn of Africa
title_full_unstemmed HornMT – Machine Translation Benchmark Dataset for Languages in the Horn of Africa
title_sort hornmt – machine translation benchmark dataset for languages in the horn of africa
publisher Zenodo
publishDate 2022
url https://dx.doi.org/10.5281/zenodo.6369442
https://zenodo.org/record/6369442
geographic Arctic
geographic_facet Arctic
genre Arctic
Climate change
genre_facet Arctic
Climate change
op_relation https://dx.doi.org/10.5281/zenodo.6369441
op_rights Open Access
Creative Commons Attribution 4.0 International
https://creativecommons.org/licenses/by/4.0/legalcode
cc-by-4.0
info:eu-repo/semantics/openAccess
op_rightsnorm CC-BY
op_doi https://doi.org/10.5281/zenodo.6369442
https://doi.org/10.5281/zenodo.6369441
_version_ 1766349923186376704
spelling ftdatacite:10.5281/zenodo.6369442 2023-05-15T15:19:43+02:00 HornMT – Machine Translation Benchmark Dataset for Languages in the Horn of Africa Hadgu, Asmelash Teka Gebremeskel, Gebrekirstos G. Aregawi, Abel 2022 https://dx.doi.org/10.5281/zenodo.6369442 https://zenodo.org/record/6369442 en eng Zenodo https://dx.doi.org/10.5281/zenodo.6369441 Open Access Creative Commons Attribution 4.0 International https://creativecommons.org/licenses/by/4.0/legalcode cc-by-4.0 info:eu-repo/semantics/openAccess CC-BY Machine Translation Parallel Corpus Horn of Africa Ethiopia Natural Language Processing Afar Amharic Oromo Somali Tigrinya Dataset dataset 2022 ftdatacite https://doi.org/10.5281/zenodo.6369442 https://doi.org/10.5281/zenodo.6369441 2022-04-01T17:40:28Z The HornMT repository contains data and the associated metadata for the project Machine Translation Benchmark Dataset for Languages in the Horn of Africa. It is a multi-way parallel corpus that will serve as a benchmark to accelerate progress in machine translation research and production systems for languages in the Horn of Africa. Supported Languages Language ISO 639-3 code Afar aaf Amharic amh English eng Oromo orm Somali som Tigrinya tir data/ contains one text file per language and each file contains news snippets in the same order for each language. data ├── aar.txt ├── amh.txt ├── eng.txt ├── orm.txt ├── som.txt └── tir.txt metadata.tsv contains tab separated data describing each news snippet. The metadata contains the following fields. Scope - describes whether the news is global or local. It takes two values: Global news and Local news. Category - News category covering the following 12 topics Art and Culture Business and Economy Conflicts and Attacks Disaster and Accidents Entertainment Environment Health International Relations Law and Crime Politics Science and Technology Sport Source - List of one or more URLs from which the news content is extracted or based on. Domain - TLD corresponding to the URL(s) in Source. Date - The publication date of the source article. The format is yyyy-mm-dd. Other formats All the data and associated metadata together in one file is also available in other file formats. HornMT.xlsx - data and associated metadata in xlsx format. HornMT.json - data and associated metadata in json format. Below is an example row. <code class="language-javascript">{ "data":{ "eng":"The World Meteorological Organisation reports that the ozone layer is damaged to its worst extent ever in the Arctic.", "aaf":"Baad Metrolojih Eglali Areketekeh Addal Ozonih qelu faxe waktik lafetle calat biyakisem xayose.", "amh":"የአለም የአየር ንብረት ድርጅት በአርክቲክ አካባቢ ያለው የኦዞን ምንጣፍ ከፍተኛ ጉዳት እንደደረሰበት አስታወቀ፡፡", "orm":"Dhaabbanni Meetiroolojii Addunyaa baqqaanni oozonii Arkiitik keessatti gara sadarkaa isa hamaa haga ammaatti akka miidhame gabaase.", "som":"Ururka Saadaasha Hawada Adduunka ayaa ku warramaya in lakabka ozoneka ee Ka koreeya dhulka baraflayda uu waxyeelladii abid ugu darnaa soo gaadhay.", "tir":"ውድብ ሜትሮሎጂ ዓለም ኣብ ኣርክቲክ ዝርከብ ናሕሲ ኦዞን ኣዝዩ ብዝኸፍአ ደረጃ ከምዝተጎድአ ሓቢሩ፡፡" }, "metadata":{ "scope":"Global", "category":"Science and Technology", "source":"https://www.independent.co.uk/environment/climate-change/ozone-layer-damaged-by-unusually-harsh-winter-2263653.html", "domain":"www.independent.co.uk", "date":"2011-04-05" } } Team Afar Mohammed Deresa Yasin Nur Amharic Tigist Taye Selamawit Hailemariam Wako Tilahun Oromo Gemechis Melkamu Galata Girmaye Somali Abdiselam Mohamed Beshir Abdi Tigrinya Berhanu Abadi Weldegiorgis Michael Minassie Nureddin Mohammedshiek Project Leaders Asmelash Teka Hadgu asme@lesan.ai Gebrekirstos G. Gebremeskel gebrekirstos.gebremeskel@ru.nl Abel Aregawi abel@lesan.ai License Shield: CC BY 4.0 This work is licensed under a Creative Commons Attribution 4.0 International License. : This work was carried out with support from Lacuna Fund, an initiative co‐founded by The Rockefeller Foundation, Google.org, and Canada's International Development Research Centre. Dataset Arctic Climate change DataCite Metadata Store (German National Library of Science and Technology) Arctic