Big Data: Analysis of English Word Frequencies

Back in the 1960s researcher Mark Mayzner wrote a seminal work on the frequency of English words, based on a sample of 20,000 word corpus. Groundbreaking at the time, Mayzner approached Peter Norvig of Google back in 2012 to see if their massive collection of online data, the Google Corpus Data, might be useful to make a broader analysis of the frequency of English language words. Taken from Google Books Ngrams, Norvig published his results. He focused on 97,565 distinct words which were found more than 100,000 times in the corpus. These words collectively occurred 743 billion different times within the Google Books data.

Unsurprisingly “the” is the most-frequent word found in English. Then YouTuber Abacaba made a visualization of some of Norvig’s more intriguing findings. Such as the commonality of longer character-count words. While “the” is the most-freqently-found word, it is only 3-letters long. What is the most common 10-character word? What about the most common 15-character word, or 20-character word? Curious? Watch Abacaba’s animation to learn more!

Previous
Previous

The Beatles, in Translation

Next
Next

Katakana: the Go-To for Loan Words in Japanese