Workshop on NLP Applications to Field Linguistics Call for PapersField Matters 2024Field Matters 2023Field Matters 2022
This year, we offer two shared tasks on processing speech in field linguistic recordings. Linguistic data collection involves recording narratives, wordlists, and grammatical enqueting. Narratives are a priceless source of linguistic, anthropological and socio-cultural information. Wordlists are basic building blocks for everyone who studies the language, and for those who learn languages. Enquetes provide a unique view on what is plausible, and what is forbidden in a language, providing researchers with negative examples to adjust their theoretical models.
Automatic speech processing will optimize the time spent on language data treatment. In the shared tasks, we seek for means to reduce such a monotonous routine. We propose two tasks, targeting two stages of linguistic recordings annotation: diarization and transcription (ASR).
Before transcribing the speech, we want to identify who does speak and when. Unlike common speech corpora, field records often contain multiple types of noise, such as wind howling, animal shrikes and other. In addition to that, there are two or more participants who interact using two languages.
We are particularly interested in finding the native speakers’ segments. In the training data, only the target language is annotated as well as the overall number of speakers, and only this information gets taken into account in the automatic evaluation during the coding period. Evaluation of the held-out dataset where both under-resourced language and high-resourced language chunks are annotated will be held after the submission deadline.
You are expected to provide the transcription of a given recording of a under-resourced language speech.
We don’t expect an accurate transcription to be accomplished at this time. Linguistically and phonetically motivated errors receive fewer penalty than uninterpretable ones. That is, predicting /s/ instead of /z/ gets fewer penalty than predicting /s/ instead of /b/. We don’t also pay attention to word boundaries detection. Therefore predictions “the cat sat on the mat” and “theca tsaton them at” are considered equal.