ARPA Language Model File Format
The format of a ARPA language model file, as far as I can tell, is not documented outside of the CMU SLM toolkit source code. I’m regurgitating it here in the hopes that future Google searches on this topic are more fruitful than mine were.
/* This is the format introduced and first used by Doug Paul. Optionally use a given symbol for the UNK word (id==0). */ /* Format of the .arpabo file: ------------------------------data ngram 1=4989 ngram 2=835668 ngram 3=12345678 1-grams: … -0.9792 ABC -2.2031 … log10_uniprob(ZWEIG) ZWEIG log10_alpha(ZWEIG) 2-grams: … -0.8328 ABC DEFG -3.1234 … log10_bo_biprob(WAS | ZWEIG) ZWEIG WAS log10_bialpha(ZWEIG,WAS) 3-grams: … -0.234 ABCD EFGHI JKL … end */







