STATISTICAL MEASURES IN CORPUS LINGUISTICS
Keywords:
Corpus linguistics, statistical measures, frequency, dispersion, log-likelihood, chi-square, Mutual Information, t-score, keyness, collocation, concordance, frequency distribution, data analysis, quantitative linguistics, significance testing, lexical association, corpus comparison, word frequency list, empirical research, computational linguisticsAbstract
This article provides an overview of the most important statistical measures used in corpus linguistics for quantitative analysis of language data. Modern corpus linguistics combines linguistics with statistics to study how often words, structures, and patterns occur in real communication. Statistical measures such as frequency, dispersion, log-likelihood, chi-square, Mutual Information (MI), and t-score help researchers identify meaningful linguistic patterns, collocations, and register variation.
References
1.Biber, D., Conrad, S., & Reppen, R. Corpus Linguistics: Investigating Language Structure and Use. Cambridge University Press, 1998.
2. Dunning, T. Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 1993.
3. Church, K., & Hanks, P. Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics, 1990.
4. Baker, P. Using Corpora in Discourse Analysis. Continuum, 2006.
5. McEnery, T., & Hardie, A. Corpus Linguistics: Method, Theory and Practice. Cambridge University Press, 2012.
6. Kilgarriff, A. Comparing Corpora. International Journal of Corpus Linguistics, 2001.
7. Stubbs, M. Text and Corpus Analysis. Blackwell, 1996.
8. Gries, S. Th. Quantitative Corpus Linguistics with R. Routledge, 2013.
9. Brezina, V. Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press, 2018.
10. Leech, G. Meaning and the English Verb. Longman, 2004.