HornMT – Machine Translation Benchmark Dataset for Languages in the Horn of Africa

The HornMT repository contains data and the associated metadata for the project Machine Translation Benchmark Dataset for Languages in the Horn of Africa . It is a multi-way parallel corpus that will serve as a benchmark to accelerate progress in machine translation research and production systems f...

Full description

Bibliographic Details
Main Authors: Asmelash Teka Hadgu, Gebrekirstos G. Gebremeskel, Abel Aregawi
Format: Other/Unknown Material
Language:English
Published: Zenodo 2022
Subjects:
Online Access:https://doi.org/10.5281/zenodo.6369442
Description
Summary:The HornMT repository contains data and the associated metadata for the project Machine Translation Benchmark Dataset for Languages in the Horn of Africa . It is a multi-way parallel corpus that will serve as a benchmark to accelerate progress in machine translation research and production systems for languages in the Horn of Africa. Supported Languages Language ISO 639-3 code Afar aaf Amharic amh English eng Oromo orm Somali som Tigrinya tir data/ contains one text file per language and each file contains news snippets in the same order for each language. data ├── aar.txt ├── amh.txt ├── eng.txt ├── orm.txt ├── som.txt └── tir.txt metadata.tsv contains tab separated data describing each news snippet. The metadata contains the following fields. Scope - describes whether the news is global or local. It takes two values: Global news and Local news. Category - News category covering the following 12 topics Art and Culture Business and Economy Conflicts and Attacks Disaster and Accidents Entertainment Environment Health International Relations Law and Crime Politics Science and Technology Sport Source - List of one or more URLs from which the news content is extracted or based on. Domain - TLD corresponding to the URL(s) in Source. Date - The publication date of the source article. The format is yyyy-mm-dd. Other formats All the data and associated metadata together in one file is also available in other file formats. HornMT.xlsx - data and associated metadata in xlsx format. HornMT.json - data and associated metadata in json format. Below is an example row. <code class="language-javascript">{ "data":{ "eng":"The World Meteorological Organisation reports that the ozone layer is damaged to its worst extent ever in the Arctic.", "aaf":"Baad Metrolojih Eglali Areketekeh Addal Ozonih qelu faxe waktik lafetle calat biyakisem xayose.", "amh":"የአለም የአየር ንብረት ድርጅት በአርክቲክ አካባቢ ያለው የኦዞን ምንጣፍ ከፍተኛ ጉዳት እንደደረሰበት አስታወቀ፡፡", "orm":"Dhaabbanni Meetiroolojii Addunyaa baqqaanni oozonii Arkiitik keessatti gara sadarkaa isa hamaa ...