Google's N-gram Corpus LDC2006T13

Google’s LDC2006T13 corpus is organized in an understandable but slightly annoying way; as a tar of split gzipped files. To avoid having to untar it repeatedly, (in fact, at all, as it’s >100GB extracted), I wrote a small Python generator that let’s you iterate over them in their compressed state. Usage is something like this:

corpus = LDC2006T13()
for ngram, count in corpus.ngrams(3):
  print ngram, count

Code is here: LDC2006T13.py

Leave a Reply


<Kered.org>   © Copyright 2000-2005 by Derek Anderson
Get Firefox