Field matters

The first workshop on NLP applications to field linguistics

Call for papers
Shared tasks

Field Matters: Speech Processing Tasks

Important dates

What’s new?

July 17

July 10

June 3

May 17

Description

This year, we offer two shared tasks on processing speech in field linguistic recordings. Linguistic data collection involves recording narratives, wordlists, and grammatical enqueting. Narratives are a priceless source of linguistic, anthropological and socio-cultural information. Wordlists are basic building blocks for everyone who studies the language, and for those who learn languages. Enquetes provide a unique view on what is plausible, and what is forbidden in a language, providing researchers with negative examples to adjust their theoretical models.

Automatic speech processing will optimize the time spent on language data treatment. In the shared tasks, we seek for means to reduce such a monotonous routine. We propose two tasks, targeting two stages of linguistic recordings annotation: diarization and transcription (ASR).

Diarization

Before transcribing the speech, we want to identify who does speak and when. Unlike common speech corpora, field records often contain multiple types of noise, such as wind howling, animal shrikes and other. In addition to that, there are two or more participants who interact using two languages.

We are particularly interested in finding the native speakers’ segments. In the training data, only the target language is annotated as well as the overall number of speakers, and only this information gets taken into account in the automatic evaluation during the coding period. Evaluation of the held-out dataset where both under-resourced language and high-resourced language chunks are annotated will be held after the submission deadline.

Pilot data

  •   dia_data.csv — pilot dataset for the Diarization track
  •   sound.zip — an archive containing the files referenced in pilot dataset

Train 1 data

  •   dia_data.csv — Train 1 dataset for the Diarization track
  •   dia_sound.zip — an archive containing the files referenced in the dataset

Train 2 data

  •   dia_data.csv — Train 2 dataset for the Diarization track
  •   dia_sound.zip — an archive containing the files referenced in the dataset

Test data

  •   dia_data.csv — Test dataset for the Diarization track
  •   dia_sound.zip — an archive containing the files referenced in the dataset

Baseline solution

  • As a baseline for diarization task, we take pyannote-audio.
  • For diarization task we will measure weighted Jaccard error rate. Weights for native speakers of under-resoursed languages and linguists differ. The omit of a segment segment will also weight more than a false detected segment.
  •   diarization_baseline.ipynb

ASR

You are expected to provide the transcription of a given recording of a under-resourced language speech.

We don’t expect an accurate transcription to be accomplished at this time. Linguistically and phonetically motivated errors receive fewer penalty than uninterpretable ones. That is, predicting /s/ instead of /z/ gets fewer penalty than predicting /s/ instead of /b/. We don’t also pay attention to word boundaries detection. Therefore predictions “the cat sat on the mat” and “theca tsaton them at” are considered equal.

Data

  •   asr_data.csv — pilot dataset for the ASR track
  •   sound.zip — an archive containing the files referenced in pilot dataset

Train 1 data

  •   asr_data.csv — Train 1 dataset for the ASR track
  •   asr_sound.zip — an archive containing the files referenced in the dataset

Train 2 data

  •   asr_data.csv — Train 2 dataset for the ASR track
  •   asr_sound.zip — an archive containing the files referenced in the dataset

Test data

  •   asr_data.csv — Test dataset for the ASR track
  •   asr_sound.zip — an archive containing the files referenced in the dataset

Baseline solution

  • Our baseline for ASR is based on the model wav2vec2.
  • For ASR task we will measure phonetic error rate with weights based on phonetic similarity between a recognised phomene and a right answer.
  •   asr_baseline.ipynb

Data sample

start end transcription
00:00:00.019 00:00:02.321 ekmu unideji enikrɨn oram
00:00:02.640 00:00:03.900 (a linguist speaking)
00:00:04.036 00:00:07.330 ekmu uni unirin oram