Field matters

Field Matters: Speech Processing Tasks

Important links

Important dates

Test data release; Evaluation start: July 17, 2022
Evaluation end: August 10, 2022
System description paper deadline: August 17, 2022
Deadline for reviews of system description papers: August 26, 2022
Author notifications: August 28, 2022
Camera-ready description paper deadline: September 5, 2022

What’s new?

July 17

Train 2 data
Test data
CodaLab competition

July 10

Final schedule is out

June 3

Train 1 data is now out
We fixed some problems in the previous release (the markup is now in IPA)
The notebooks with baseline solutions for the both tasks are published

May 17

Pilot data is out

Description

This year, we offer two shared tasks on processing speech in field linguistic recordings. Linguistic data collection involves recording narratives, wordlists, and grammatical enqueting. Narratives are a priceless source of linguistic, anthropological and socio-cultural information. Wordlists are basic building blocks for everyone who studies the language, and for those who learn languages. Enquetes provide a unique view on what is plausible, and what is forbidden in a language, providing researchers with negative examples to adjust their theoretical models.

Automatic speech processing will optimize the time spent on language data treatment. In the shared tasks, we seek for means to reduce such a monotonous routine. We propose two tasks, targeting two stages of linguistic recordings annotation: diarization and transcription (ASR).

Diarization

Before transcribing the speech, we want to identify who does speak and when. Unlike common speech corpora, field records often contain multiple types of noise, such as wind howling, animal shrikes and other. In addition to that, there are two or more participants who interact using two languages.

We are particularly interested in finding the native speakers’ segments. In the training data, only the target language is annotated as well as the overall number of speakers, and only this information gets taken into account in the automatic evaluation during the coding period. Evaluation of the held-out dataset where both under-resourced language and high-resourced language chunks are annotated will be held after the submission deadline.

Pilot data

dia_data.csv — pilot dataset for the Diarization track
sound.zip — an archive containing the files referenced in pilot dataset

Train 1 data

dia_data.csv — Train 1 dataset for the Diarization track
dia_sound.zip — an archive containing the files referenced in the dataset

Train 2 data

dia_data.csv — Train 2 dataset for the Diarization track
dia_sound.zip — an archive containing the files referenced in the dataset

Test data

dia_data.csv — Test dataset for the Diarization track
dia_sound.zip — an archive containing the files referenced in the dataset

Baseline solution

As a baseline for diarization task, we take pyannote-audio.
For diarization task we will measure weighted Jaccard error rate. Weights for native speakers of under-resoursed languages and linguists differ. The omit of a segment segment will also weight more than a false detected segment.
diarization_baseline.ipynb

ASR

You are expected to provide the transcription of a given recording of a under-resourced language speech.

We don’t expect an accurate transcription to be accomplished at this time. Linguistically and phonetically motivated errors receive fewer penalty than uninterpretable ones. That is, predicting /s/ instead of /z/ gets fewer penalty than predicting /s/ instead of /b/. We don’t also pay attention to word boundaries detection. Therefore predictions “the cat sat on the mat” and “theca tsaton them at” are considered equal.

Data

asr_data.csv — pilot dataset for the ASR track
sound.zip — an archive containing the files referenced in pilot dataset

Train 1 data

asr_data.csv — Train 1 dataset for the ASR track
asr_sound.zip — an archive containing the files referenced in the dataset

Train 2 data

asr_data.csv — Train 2 dataset for the ASR track
asr_sound.zip — an archive containing the files referenced in the dataset

Test data

asr_data.csv — Test dataset for the ASR track
asr_sound.zip — an archive containing the files referenced in the dataset

Baseline solution

Our baseline for ASR is based on the model wav2vec2.
For ASR task we will measure phonetic error rate with weights based on phonetic similarity between a recognised phomene and a right answer.
asr_baseline.ipynb

Data sample

start	end	transcription
00:00:00.019	00:00:02.321	ekmu unideji enikrɨn oram
00:00:02.640	00:00:03.900	(a linguist speaking)
00:00:04.036	00:00:07.330	ekmu uni unirin oram