Saturday, October 21, 2023

How many Unique Words are in a Text?

How much text would you have to understand to get a good understanding of a language? There is a formula to estimate how many unique words you would find as you read an amount of text.

Heaps' law lets you estimate after 10k, 20k etc how many unique words you would have seen. As the number of words go up how often you find a new word decreases. 



For reference 3,000 words is enough to carry out a lot of everyday conversations. Fluent people know about 10,000 words. You can check the word counts of famous books here to put this graph into context. Graph code here.

Languages are not a list of words to memorise. You have to learn common patterns and grammar also. But as new words become rarer as you see more text you are also getting more repetitions of common patterns. Which will help you internalise those common patterns.

There is a collection of short books, by Irish authors, to help with literacy skills called Open Door. These books are about 10K words each. These books, by famous authors, together cover a lot of the language.  If you read Patricia Scanlon's novella you would see about 2000 unique words. Roddy Doyles would bring you up to 2600. Marian Keys, Maeve Binchy, John Connolly and other great writes are in the series which will keep adding new words. This graph is an estimate of this coverage.



I bring up the series as the have Irish language versions. Including with audio. Irish does not have enough content available in English, Irish and with audio. I will discuss how audio and text might combine together to help language learning in my next post.


No comments: