Community-based corpus-building: Three case studies
We describe three ongoing projects involving different First Peoples’ languages of Canada (Cree/nehiyawewin, Dene Sųłiné, and Nakoda/Stoney) that centre around the recording, transcription, compilation, and analysis of spontaneous oral language use––some narrative, some conversation––us...
Main Authors: | , |
---|---|
Format: | Text |
Language: | unknown |
Published: |
2017
|
Subjects: | |
Online Access: | http://hdl.handle.net/10125/42052 |
id |
ftolac:oai:scholarspace.manoa.hawaii.edu:10125/42052 |
---|---|
record_format |
openpolar |
institution |
Open Polar |
collection |
OLAC: Open Language Archives Community |
op_collection_id |
ftolac |
language |
unknown |
description |
We describe three ongoing projects involving different First Peoples’ languages of Canada (Cree/nehiyawewin, Dene Sųłiné, and Nakoda/Stoney) that centre around the recording, transcription, compilation, and analysis of spontaneous oral language use––some narrative, some conversation––using freely available, Unicode-savvy corpus software (in this case, AntConc [Anthony 2014]) and little to no up- front annotation or translation into English. Because these languages are all polysynthetic, lemmatization and POS tagging are either unachievable or excessively time-draining and indeterminate activities. Nevertheless, corpus creation can still continue apace and reap huge benefits using the most basic of corpus tools. These projects are consonant with a growing ethos in language documentation circles that advocate for the value of corpus development alongside more traditional documentary activities (cf. McEnery & Ostler 2000, Woodbury 2003, Crowley 2007, Cox 2011, Mosel 2014, Vinogradov 2016). Each corpus is at a different stage of development, yet we hope to persuade community-based colleagues of the enormous benefits that ensue from the deliberate creation and use of a corpus of naturally occurring language data for language analysis and teaching. Direct benefits include ready-to-hand word lists; authentic sample utterances for exemplifying dictionaries, phrasebooks, and grammatical sketches; and a conscientious focus on recording many speakers across different demographic categories, discursive situations, and registers in order to achieve a broad range of usage conditions. A focus on wide and balanced sampling clearly strengthens the data pool from which analyses can follow. But it also results in a closer connection by speakers/learners to important and recurring phenomena in their language rather than to descriptions of phenomena that may have emerged through bilingual situations with a handful of speakers under the direct control of non-speaking linguists (who may have been guided by theoretical ... |
author2 |
Rice, Sally Thunder, Dorothy |
format |
Text |
author |
Rice, Sally Thunder, Dorothy |
spellingShingle |
Rice, Sally Thunder, Dorothy Community-based corpus-building: Three case studies |
author_facet |
Rice, Sally Thunder, Dorothy |
author_sort |
Rice, Sally |
title |
Community-based corpus-building: Three case studies |
title_short |
Community-based corpus-building: Three case studies |
title_full |
Community-based corpus-building: Three case studies |
title_fullStr |
Community-based corpus-building: Three case studies |
title_full_unstemmed |
Community-based corpus-building: Three case studies |
title_sort |
community-based corpus-building: three case studies |
publishDate |
2017 |
url |
http://hdl.handle.net/10125/42052 |
genre |
Nakoda |
genre_facet |
Nakoda |
op_relation |
http://hdl.handle.net/10125/42052 Rice, Sally, Thunder, Dorothy, Rice, Sally, Thunder, Dorothy; 2017-03-03; We describe three ongoing projects involving different First Peoples’ languages of Canada (Cree/nehiyawewin, Dene Sųłiné, and Nakoda/Stoney) that centre around the recording, transcription, compilation, and analysis of spontaneous oral language use––some narrative, some conversation––using freely available, Unicode-savvy corpus software (in this case, AntConc [Anthony 2014]) and little to no up- front annotation or translation into English. Because these languages are all polysynthetic, lemmatization and POS tagging are either unachievable or excessively time-draining and indeterminate activities. Nevertheless, corpus creation can still continue apace and reap huge benefits using the most basic of corpus tools. These projects are consonant with a growing ethos in language documentation circles that advocate for the value of corpus development alongside more traditional documentary activities (cf. McEnery & Ostler 2000, Woodbury 2003, Crowley 2007, Cox 2011, Mosel 2014, Vinogradov 2016). Each corpus is at a different stage of development, yet we hope to persuade community-based colleagues of the enormous benefits that ensue from the deliberate creation and use of a corpus of naturally occurring language data for language analysis and teaching. Direct benefits include ready-to-hand word lists; authentic sample utterances for exemplifying dictionaries, phrasebooks, and grammatical sketches; and a conscientious focus on recording many speakers across different demographic categories, discursive situations, and registers in order to achieve a broad range of usage conditions. A focus on wide and balanced sampling clearly strengthens the data pool from which analyses can follow. But it also results in a closer connection by speakers/learners to important and recurring phenomena in their language rather than to descriptions of phenomena that may have emerged through bilingual situations with a handful of speakers under the direct control of non-speaking linguists (who may have been guided by theoretical concerns unrelated to actual language use). Our demonstration corpora vary in size and composition, but each is already useful in revealing frequency, collocational, and distributional information about lexical items and morphosyntactic devices that may have received scant prior attention. We discuss the basics of corpus creation from scratch, the role of strategic metadata and file-naming practices, and illustrate the types of immediately interpretable analyses that standard corpus tools can provide with monolingual, untagged transcripts. Best of all, once the central principles and logistics of corpus creation are mastered, the corpus can grow in a natural and incremental way, involving an expanding group of participants. Ultimately, a broadly sampled corpus can provide a solid empirical basis for the study of lexico-syntactic phenomena, not to mention a lasting, reusable, and shareable record of actual language use. References Anthony, L. 2014. AntConc (Version 3.4.1m) [Computer Software]. Tokyo: Waseda University. Available from http://www.laurenceanthony.net/. Cox, C. 2011. Corpus linguistics and language documentation: Challenges for collaboration. In Newman, J., R. H. Baayen, & S. Rice (eds.), Corpus-Based Studies in Language Use, Language Learning, and Language Documentation, 239-264. Amsterdam: Brill. Crowley, T. 2007. Field Linguistics: A Beginner’s Guide. Oxford: Oxford University Press. McEnery, T. & N. Ostler. 2000. A new agenda for corpus linguistics––working with all of the world’s languages. Literary and Linguistic Computing 15 (4): 403-420. Mosel, U. 2014. Corpus linguistic and documentary approaches in writing a grammar of a previously undescribed language. Language Documentation and Conservation 8: 135-157. Vinogradov, I. 2016. Linguistic corpora of understudied languages: Do they make sense? Káñina 40(1): 127-141. Woodbury, T. 2003. Defining documentary linguistics. Language Documentation and Description 1(1): 35-51.; Kaipuleohone University of Hawai'i Digital Language Archive;http://hdl.handle.net/10125/42052. |
_version_ |
1810457138901811200 |
spelling |
ftolac:oai:scholarspace.manoa.hawaii.edu:10125/42052 2024-09-15T18:19:02+00:00 Community-based corpus-building: Three case studies Rice, Sally Thunder, Dorothy Rice, Sally Thunder, Dorothy 2017-03-03 http://hdl.handle.net/10125/42052 unknown http://hdl.handle.net/10125/42052 Rice, Sally, Thunder, Dorothy, Rice, Sally, Thunder, Dorothy; 2017-03-03; We describe three ongoing projects involving different First Peoples’ languages of Canada (Cree/nehiyawewin, Dene Sųłiné, and Nakoda/Stoney) that centre around the recording, transcription, compilation, and analysis of spontaneous oral language use––some narrative, some conversation––using freely available, Unicode-savvy corpus software (in this case, AntConc [Anthony 2014]) and little to no up- front annotation or translation into English. Because these languages are all polysynthetic, lemmatization and POS tagging are either unachievable or excessively time-draining and indeterminate activities. Nevertheless, corpus creation can still continue apace and reap huge benefits using the most basic of corpus tools. These projects are consonant with a growing ethos in language documentation circles that advocate for the value of corpus development alongside more traditional documentary activities (cf. McEnery & Ostler 2000, Woodbury 2003, Crowley 2007, Cox 2011, Mosel 2014, Vinogradov 2016). Each corpus is at a different stage of development, yet we hope to persuade community-based colleagues of the enormous benefits that ensue from the deliberate creation and use of a corpus of naturally occurring language data for language analysis and teaching. Direct benefits include ready-to-hand word lists; authentic sample utterances for exemplifying dictionaries, phrasebooks, and grammatical sketches; and a conscientious focus on recording many speakers across different demographic categories, discursive situations, and registers in order to achieve a broad range of usage conditions. A focus on wide and balanced sampling clearly strengthens the data pool from which analyses can follow. But it also results in a closer connection by speakers/learners to important and recurring phenomena in their language rather than to descriptions of phenomena that may have emerged through bilingual situations with a handful of speakers under the direct control of non-speaking linguists (who may have been guided by theoretical concerns unrelated to actual language use). Our demonstration corpora vary in size and composition, but each is already useful in revealing frequency, collocational, and distributional information about lexical items and morphosyntactic devices that may have received scant prior attention. We discuss the basics of corpus creation from scratch, the role of strategic metadata and file-naming practices, and illustrate the types of immediately interpretable analyses that standard corpus tools can provide with monolingual, untagged transcripts. Best of all, once the central principles and logistics of corpus creation are mastered, the corpus can grow in a natural and incremental way, involving an expanding group of participants. Ultimately, a broadly sampled corpus can provide a solid empirical basis for the study of lexico-syntactic phenomena, not to mention a lasting, reusable, and shareable record of actual language use. References Anthony, L. 2014. AntConc (Version 3.4.1m) [Computer Software]. Tokyo: Waseda University. Available from http://www.laurenceanthony.net/. Cox, C. 2011. Corpus linguistics and language documentation: Challenges for collaboration. In Newman, J., R. H. Baayen, & S. Rice (eds.), Corpus-Based Studies in Language Use, Language Learning, and Language Documentation, 239-264. Amsterdam: Brill. Crowley, T. 2007. Field Linguistics: A Beginner’s Guide. Oxford: Oxford University Press. McEnery, T. & N. Ostler. 2000. A new agenda for corpus linguistics––working with all of the world’s languages. Literary and Linguistic Computing 15 (4): 403-420. Mosel, U. 2014. Corpus linguistic and documentary approaches in writing a grammar of a previously undescribed language. Language Documentation and Conservation 8: 135-157. Vinogradov, I. 2016. Linguistic corpora of understudied languages: Do they make sense? Káñina 40(1): 127-141. Woodbury, T. 2003. Defining documentary linguistics. Language Documentation and Description 1(1): 35-51.; Kaipuleohone University of Hawai'i Digital Language Archive;http://hdl.handle.net/10125/42052. Text Sound 2017 ftolac 2024-08-06T23:37:27Z We describe three ongoing projects involving different First Peoples’ languages of Canada (Cree/nehiyawewin, Dene Sųłiné, and Nakoda/Stoney) that centre around the recording, transcription, compilation, and analysis of spontaneous oral language use––some narrative, some conversation––using freely available, Unicode-savvy corpus software (in this case, AntConc [Anthony 2014]) and little to no up- front annotation or translation into English. Because these languages are all polysynthetic, lemmatization and POS tagging are either unachievable or excessively time-draining and indeterminate activities. Nevertheless, corpus creation can still continue apace and reap huge benefits using the most basic of corpus tools. These projects are consonant with a growing ethos in language documentation circles that advocate for the value of corpus development alongside more traditional documentary activities (cf. McEnery & Ostler 2000, Woodbury 2003, Crowley 2007, Cox 2011, Mosel 2014, Vinogradov 2016). Each corpus is at a different stage of development, yet we hope to persuade community-based colleagues of the enormous benefits that ensue from the deliberate creation and use of a corpus of naturally occurring language data for language analysis and teaching. Direct benefits include ready-to-hand word lists; authentic sample utterances for exemplifying dictionaries, phrasebooks, and grammatical sketches; and a conscientious focus on recording many speakers across different demographic categories, discursive situations, and registers in order to achieve a broad range of usage conditions. A focus on wide and balanced sampling clearly strengthens the data pool from which analyses can follow. But it also results in a closer connection by speakers/learners to important and recurring phenomena in their language rather than to descriptions of phenomena that may have emerged through bilingual situations with a handful of speakers under the direct control of non-speaking linguists (who may have been guided by theoretical ... Text Nakoda OLAC: Open Language Archives Community |