You may have heard about a massive new database that Google has provided to academia. Happily, they've also shared their new toy with us armchair nerds.
Over the past several years, Google and its university partners have been scanning every book they can get their hands on into the searchable Google Books resource. Despite the lawsuits, they've collected over 15 million books. Meanwhile, a team at Harvard led by researchers Jean-Baptise Michel and Erez Lieberman Aiden has been digging through this immense trove of data and pulling out all kinds of gems.
For their first study, published last week by Science, the authors pared down the data set to only the most reliable books--excluding, for example, those with blurry scans or uncertain dates of publication. The resulting data set was 5 million books. By searching the database for words and phrases (n-grams), the researchers were able to track patterns and changes in the English language. You can read their whole study, and see all their graphs, at the link above (with a free registration).
Among other findings, they showed how the number of English words has been steadily increasing...