Examining Lexical Diversity with Entropy and Other Mathematical Measures
Abstract
Lexical diversity, one of the primary measures of language proficiency or complexity, measures the diversity of words produced in a given text or speech sample. Conventional measures such as the type-token ratio (TTR) have shortcomings, leading to the introduction of more sophisticated mathematical metrics such as entropy, Shannon diversity index and advanced statistical models. It also explains the behind-the-scenes of the two existing entropy-based lexical diversity measures, such as their theoretical basis and their computation, and that they have found applications in linguistics, psycholinguistics, and natural language processing (NLP). Comparative analysis illustrates that with the consideration of entropy, measures have a more accurate indication of lexical richness when compared with the conventional methods, especially in the texts with varying lengths. The article closes with some suggestions for further research to enhance measurement of lexical diversity through a combination of machine learning and large-scale corpus analysis.
Keywords: Lexical diversity, entropy, Shannon index, type-token ratio, computational linguistics.