Google's N-gram Corpus LDC2006T13
Google’s LDC2006T13 corpus is organized in an understandable but slightly annoying way; as a tar of split gzipped files. To avoid having to untar it repeatedly, (in fact, at all, as it’s >100GB extracted), I wrote a small Python generator that let’s you iterate over them in their compressed state. Usage is something like this:
corpus = LDC2006T13()
for ngram, count in corpus.ngrams(3):
print ngram, count
Code is here: LDC2006T13.py







