BUILDING A BALANCED CORPUS: PRINCIPLES AND CHALLENGES
Keywords:
corpus linguistics, balanced corpus, representativeness, linguistic data, annotation, sampling, computational linguisticsAbstract
Corpus linguistics has become one of the most dynamic areas of contemporary linguistic research, supporting the empirical study of language through large, structured collections of authentic texts. The construction of a balanced corpus is a fundamental yet complex process that directly determines the quality and representativeness of linguistic analyses derived from it. This paper examines the key theoretical and practical principles involved in corpus design, balance, and representativeness. Drawing on international experiences and methodological frameworks, it analyzes challenges faced in building corpora for underrepresented and low-resource languages. The study discusses sampling strategies, metadata design, annotation standards, and the integration of multimodal and digital texts. It also outlines how corpus balance influences linguistic research outcomes and how computational tools can assist in maintaining equilibrium across genres, registers, and domains. The paper concludes by emphasizing the need for adaptive corpus design principles that reflect evolving communicative realities and digital language usage.
References
1. Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford: Oxford University Press.
2. Kennedy, G. (1998). An Introduction to Corpus Linguistics. London: Longman.
3. Biber, D. (1993). Representativeness in corpus design. Literary and Linguistic Computing, 8(4), 243–257.
4. Leech, G. (1992). Corpora and theories of linguistic performance. In J. Svartvik (Ed.), Directions in Corpus Linguistics. Berlin: Mouton de Gruyter.
5. McEnery, T., & Hardie, A. (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge: Cambridge University Press.
6. Kilgarriff, A. (2007). Issues in corpus creation and design. International Journal of Corpus Linguistics, 12(3), 403–431.
7. Zhubanov, A.K. (2018). Corpus Linguistics. Almaty: Rauan.
8. Sairambayev, T.S., & Kaliyev, S.A. (2003). Phrase combinations and syntax of a simple sentence. Bulletin of the Kazakh National University. Series Philology, No. 5, 90–91.
9. Davies, M. (2009). The 385+ million-word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14(2), 159–190.
10. Zhubanov, A.K. (2016). Corpus linguistics [Electronic resource]. URL: http://bookchamber.kz/stst_2006.htm (date of access: 03.2010).