CORPUS LINGUISTICS IN UZBEKISTAN: CURRENT STATE, CHALLENGES, AND PROSPECTS FOR UZBEK LANGUAGE RESEARCH
Keywords:
corpus linguistics, Uzbek language, language corpus, NLP, under-resourced languages, UzbekistanAbstract
This article investigates the development and current status of corpus linguistics in Uzbekistan, focusing on the construction, availability, and application of Uzbek-language corpora in linguistic research and language education. Using a mixed-methods approach that combines systematic review of existing corpora and published research with semi-structured interviews with ten Uzbek corpus linguists, the study identifies critical infrastructure gaps, methodological challenges, and institutional barriers that constrain the field. Results reveal that while several Uzbek corpora exist — including the Uzbek National Corpus and domain-specific sub-corpora — they remain significantly smaller and less annotated than comparable corpora for major world languages. Furthermore, corpus-based methods are underutilised in Uzbek linguistics pedagogy and applied language research. The article concludes with evidence-based recommendations for scaling corpus infrastructure, fostering interdisciplinary collaboration, and integrating corpus tools into university curricula. The findings contribute to the growing literature on corpus linguistics in under-resourced language contexts and provide a roadmap for Uzbek language technology development.
References
Baker, M. (1995). Corpora in translation studies: An overview and some suggestions for future research. Target, 7(2), 223–243.
Bakarov, A. (2018). A survey of word embeddings evaluation methods. arXiv preprint arXiv:1801.09536.
Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257.
Biber, D., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge University Press.
Boulton, A., & Cobb, T. (2017). Corpus use in language learning: A meta-analysis. Language Learning, 67(2), 348–393.
Braun, V., & Clarke, V. (2006). Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2), 77–101.
Burnard, L. (Ed.). (2007). Reference guide for the British National Corpus (XML edition). Oxford University Computing Services.
Carter, R., & McCarthy, M. (2017). Spoken grammar: Where are we and where are we going? Applied Linguistics, 38(1), 1–20.
Choudhury, M., & Jha, S. (2015). Assessment and interpretation of natural language processing tools for less-resourced languages. In Proceedings of LREC 2015 (pp. 4615–4619). ELRA.
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of ACL 2020 (pp. 8440–8451). ACL.
Creswell, J. W., & Plano Clark, V. L. (2018). Designing and conducting mixed methods research (3rd ed.). SAGE Publications.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019 (pp. 4171–4186). ACL.
Duval, A., & Pitt, J. (2021). Building a corpus for Uyghur: Challenges and prospects. In Proceedings of the 5th Workshop on Language Technology for Language Documentation and Revitalization (LT4LangDoc 2021) (pp. 23–31). ACL.
Ethnologue. (2023). Uzbek. In Ethnologue: Languages of the world (27th ed.). SIL International. https://www.ethnologue.com/language/uzb/
Flowerdew, L. (2015). Learner corpus research and pedagogy. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 413–434). Cambridge University Press.
Granger, S. (2003). The international corpus of learner English: A new resource for foreign language learning and teaching and second language acquisition research. TESOL Quarterly, 37(3), 538–546.
Johanson, L. (1998). The history of Uzbek. In L. Johanson & É. Á. Csató (Eds.), The Turkic languages (pp. 305–318). Routledge.
Johns, T. (1991). Should you be persuaded: Two examples of data-driven learning. In T. Johns & P. King (Eds.), Classroom concordancing (ELR Journal, Vol. 4, pp. 1–16). University of Birmingham.
Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the web as corpus. Computational Linguistics, 29(3), 333–347.
Landau, J. M., & Kellner-Heinkele, B. (2001). Politics of language in the ex-Soviet Muslim states: Azerbayjan, Uzbekistan, Kazakhstan, Kyrgyzstan, Turkmenistan and Tajikistan. Hurst & Company.
Leech, G. (2005). Adding linguistic annotation. In M. Wynne (Ed.), Developing linguistic corpora: A guide to good practice (pp. 17–29). Oxbow Books.
Love, R., Dembry, C., Hardie, A., Brezina, V., & McEnery, T. (2017). The Spoken BNC2014: Designing and building a spoken corpus of everyday conversations. International Journal of Corpus Linguistics, 22(3), 319–344.
Makhambetov, O., Makazhanov, A., Yessenbayev, Z., Matlatipov, G., & Mamyrbayev, O. (2013). Assembling the Kazakh language corpus. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1022–1031). ACL.
Mansurov, B., & Mansurov, A. (2021). Uzbek-English parallel corpus. Zenodo. https://doi.org/10.5281/zenodo.4584558
Matlatipov, G., Mukhsimov, S., & Sobirov, A. (2022). Development of the Uzbek National Corpus: Current state and perspectives. Uzbek Linguistics and Literature, 4(1), 12–28. [In Uzbek]
McEnery, T., & Hardie, A. (2012). Corpus linguistics: Method, theory and practice. Cambridge University Press.
McEnery, T., & Wilson, A. (2001). Corpus linguistics: An introduction (2nd ed.). Edinburgh University Press.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mirzayev, F., Yusupov, U., & Qodirov, B. (2021). UzBERT: A transformer-based language model for Uzbek. In Proceedings of the 2021 Conference on Asian Language Resources (ALR 2021) (pp. 58–65). ACL.
Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C. D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., & Zeman, D. (2016). Universal Dependencies v1: A multilingual treebank collection. In Proceedings of LREC 2016 (pp. 1659–1666). ELRA.
Rissanen, M. (1994). The Helsinki Corpus of English texts: Classifying and coding the data. In M. Rissanen, M. Kytö, & M. Palander-Collin (Eds.), English in its social contexts (pp. 3–38). Mouton de Gruyter.
Sinclair, J. (1991). Corpus, concordance, collocation. Oxford University Press.
Sjoberg, A. F. (1963). Uzbek structural grammar. Indiana University Publications.
Washington, J. N., Salimzianov, I., Johnson, R., Strader, D., & Yeshkeyev, A. (2016). Initiating a Kyrgyz natural language processing initiative. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) (pp. 3819–3823). ELRA.