Community-based corpus-building: Three case studies

We describe three ongoing projects involving different First Peoples’ languages of Canada (Cree/nehiyawewin, Dene Sųłiné, and Nakoda/Stoney) that centre around the recording, transcription, compilation, and analysis of spontaneous oral language use––some narrative, some conversation––using freely av...

Full description

Bibliographic Details
Main Authors: Rice, Sally, Thunder, Dorothy
Format: Text
Language:unknown
Published: 2017
Subjects:
Online Access:http://hdl.handle.net/10125/42052
Description
Summary:We describe three ongoing projects involving different First Peoples’ languages of Canada (Cree/nehiyawewin, Dene Sųłiné, and Nakoda/Stoney) that centre around the recording, transcription, compilation, and analysis of spontaneous oral language use––some narrative, some conversation––using freely available, Unicode-savvy corpus software (in this case, AntConc [Anthony 2014]) and little to no up- front annotation or translation into English. Because these languages are all polysynthetic, lemmatization and POS tagging are either unachievable or excessively time-draining and indeterminate activities. Nevertheless, corpus creation can still continue apace and reap huge benefits using the most basic of corpus tools. These projects are consonant with a growing ethos in language documentation circles that advocate for the value of corpus development alongside more traditional documentary activities (cf. McEnery & Ostler 2000, Woodbury 2003, Crowley 2007, Cox 2011, Mosel 2014, Vinogradov 2016). Each corpus is at a different stage of development, yet we hope to persuade community-based colleagues of the enormous benefits that ensue from the deliberate creation and use of a corpus of naturally occurring language data for language analysis and teaching. Direct benefits include ready-to-hand word lists; authentic sample utterances for exemplifying dictionaries, phrasebooks, and grammatical sketches; and a conscientious focus on recording many speakers across different demographic categories, discursive situations, and registers in order to achieve a broad range of usage conditions. A focus on wide and balanced sampling clearly strengthens the data pool from which analyses can follow. But it also results in a closer connection by speakers/learners to important and recurring phenomena in their language rather than to descriptions of phenomena that may have emerged through bilingual situations with a handful of speakers under the direct control of non-speaking linguists (who may have been guided by theoretical concerns ...