Scientific publications

Кипяткова И.С., Родионова А.П., Кагиров И.А., Крижановский А.А.

Подготовка речевых и текстовых данных для создания системы автоматического распознавания карельской речи

// Ученые записки Петрозаводского государственного университета. Т. 45. № 5. 2023. C. 89–98

Kipyatkova, I.S., Rodionova, A.P., Kagirov, I.A., Krizhanovsky, A.A. Speech and text data preparation for developing an automatic speech recognition system for the Karelian language // Proceedings of Petrozavodsk State University. 45(5). 2023. P. 89–98

Keywords: Karelian language, Livvi-Karelian dialect, natural language automatic processing, speech recognition systems training, datasets, corpus linguistics

This paper addresses some aspects of collecting and preparing language data of the Livvi dialect of the Karelian language needed for training a system of automatic speech-to-text conversion. The importance of such technologies for the Karelian language derives from its status as a low-resource language, which is a serious obstacle to its study and preservation. The main tasks at the current stage of the research are to collect and annotate speech and text corpora, as well as to create a transcription dictionary. The speech corpus includes audio recordings of 15 speakers (6 men and 9 women). All the recordings were transcribed and segmented into single utterances. The volume of records after the removal of “junk” fragments was 3,5 hours. The volume of the text corpus after the removal of repeated sentences was over 5M word usages. Based on the collected text corpus, a dictionary was created, which will subsequently be used as a part of the Karelian speech recognition system. All the words included in the dictionary were automatically transcribed (phonemic transcription). In the further research collected text and speech data will be used for training and testing the Livvi-Karelian speech recognition system.

DOI: 10.15393/uchz.art.2023.924

Indexed at RSCI

Last modified: July 20, 2023