A language blogger discussing technology is of course very interesting for Elangco. Today I found such an entry, discussing tools for analysing n-gram frequencies in a large database of text. (An n-gram, in language, is a sequence of n words or letters.)
I have always been a fan of the postings and podcasts on language by @GrammarGirl (Mignon Fogarty). But now she has surprised me with a posting on technology. In this post she explains about Google’s Ngram Viewer: A tool for searching word frequencies in the vast database of Google Books.
I didn’t know there were tools like this out there on the Internet, but she even points to several alternatives for the Google tool and explains some (dis)advantages of those. It seems like a great toy to play with, but it can also be very useful.
I used to do language research like this by entering words or short phrases into Google in different spellings and checking the number of hits found. But that will only tell you about the spelling in use TODAY. And of course the fact that a majority of the people on the Internet spell a word in a certain way does not guarantee that spelling is correct at all!
No, if you are looking for correct spelling, use a dictionary. And if you are looking for correct grammar, listen to GrammarGirl!
Anyway, I visited the Ngram pages for a quick look and I noticed that Google even offers the raw data for download. Using that you could do your own research and for instance do a comparison between different spellings of a word. That would solve one of the disadvantages Mignon mentions in her post: The tool only supports limited comparing of different n-grams.
But you will have to work really hard to analyse the material yourself: It is a huge amount of data! They also have statistics from different languages. Unfortunately no Dutch or Portuguese yet…
This post is also available in: Dutch