A citizen science mediated Optical Character Recognition (OCR) module for large-scale data rescue

Access to retrospective data is essential for understanding environmental variability and change, it is important for initializing and validating models of all kinds, and for illuminating the relationships between ecosystems and societies that depend on them. A major barrier to effective use of hist...

Full description

Bibliographic Details
Main Author: Mahoney, Andy
Format: Dataset
Language:English
Published: Axiom Data Science 2020
Subjects:
Online Access:https://dx.doi.org/10.24431/rw1k479
https://search.dataone.org/#view/10.24431/rw1k479
Description
Summary:Access to retrospective data is essential for understanding environmental variability and change, it is important for initializing and validating models of all kinds, and for illuminating the relationships between ecosystems and societies that depend on them. A major barrier to effective use of historical data in any discipline is the need to transform large quantities of manuscript or printed text, especially complex data tables, into formats that can be collated and analyzed by computers. However, there is currently no Optical Character Recognition (OCR) engine that can render scanned images of documents into digital text with a level of accuracy that renders human intervention unnecessary. This is especially true with respect to scientific data presented in tables or other matrix formats. The goal of this project was to build an open source citizen science mediated OCR module to facilitate transcription of complex data tables and other typescript or printed material (e.g Arctic and worldwide weather observations recorded in ship's logs), and integrate it into the Zooniverse transcription software bundle. This module will be available to the public via Zooniverse: https://www.zooniverse.org/projects/zooniverse/oldweather-ocr.