Быков Ф.Ю., Крижановский А.А.
Поиск почти похожих текстов в лингвистическом корпусе ВепКар
// Труды КарНЦ РАН. No 4. Сер. Математическое моделирование и информационные технологии. 2023. C. 16-23
Bykov F.Yu., Krizhanovsky A.A. Search for near-duplicate texts in the linguistic corpus VepKar // Transactions of Karelian Research Centre of Russian Academy of Science. No 4. Mathematical Modeling and Information Technologies. 2023. Pp. 16-23
Keywords: corpus linguistics; near-duplicate texts; Kendall rank correlation
Developers of linguistic corpora need to spot and eliminate text duplicates. An overview of approaches to searching for near-duplicate texts in various corpora is presented in this article. An algorithm and a program for searching for nearduplicate texts (based on the number of common bigrams) have been developed. Experiments were carried out with texts from the Veps and Karelian Open Corpus VepKar. The program found 100 pairs of the most similar texts and offered them to an expert, who confirmed 42 cases to be duplicates. Three metrics of text similarity were considered. The metric that was the closest to the expert’s output in its pairwise text alignments was identified using Kendall’s rank distance. The newly developed program will be a useful tool for editors of the VepKar text corpus.
