Seq2Seq architecture for low-resource languages: methods for overcoming data scarcity

Return to list of articles

Doklady Bashkirskogo Universiteta. 2026. Volume 11. No. 2. pp. 188-194.

DOI: https://doi.org/10.33184/dokbsu-2026.2.21

Download text.pdf

Authors

Miftakhova R. G.*

Ufa University of Science and Technology

*E-mail: miftahovar@yandex.ru

Mylnikov N. M.

Ufa University of Science and Technology

Abstract

The presented study contains approaches to modifying Seq2Seq models for low-resource language processing scenarios. The key methods of overcoming the “curse of dimensionality” and sparsity of data are considered: algorithms of sub-symbolic tokenization (BPE, SentencePiece); methods of multilingual cross-language transfer (multilingual transfer learning) using high-resource donor languages, as well as methods of synthetic data augmentation (back-translation). Special attention is paid to modern methods of parametrically effective retraining (PEFT). The paper presents a comparative analysis of the productivity of the described approaches, structured according to the size of the available parallel data.

Keywords

машинный перевод
обработка естественного языка
Sequence-to-Sequence
малоресурсные языки
трансферное обучение
обратный перевод
LoRA

Return to list of articles

Search form

Seq2Seq architecture for low-resource languages: methods for overcoming data scarcity

Authors

Abstract

Keywords