Archive for August, 2008

ARPA Language Model File Format

Tuesday, August 12th, 2008

The format of a ARPA language model file, as far as I can tell, is not documented outside of the CMU SLM toolkit source code. I’m regurgitating it here in the hopes that future Google searches on this topic are more fruitful than mine were. :)

/* This is the format introduced and first used by Doug Paul.
   Optionally use a given symbol for the UNK word (id==0).
*/
/*
Format of the .arpabo file:
------------------------------
\data\ ngram 1=4989 ngram 2=835668 ngram 3=12345678 \1-grams: ... -0.9792 ABC -2.2031 ... log10_uniprob(ZWEIG) ZWEIG log10_alpha(ZWEIG) \2-grams: ... -0.8328 ABC DEFG -3.1234 ... log10_bo_biprob(WAS | ZWEIG) ZWEIG WAS log10_bialpha(ZWEIG,WAS) \3-grams: ... -0.234 ABCD EFGHI JKL ... \end\ */

<Kered.org>   © Copyright 2000-2005 by Derek Anderson
Get Firefox