ARPA Language Model File Format
The format of a ARPA language model file, as far as I can tell, is not documented outside of the CMU SLM toolkit source code. I’m regurgitating it here in the hopes that future Google searches on this topic are more fruitful than mine were.
/* This is the format introduced and first used by Doug Paul. Optionally use a given symbol for the UNK word (id==0). */ /* Format of the .arpabo file: ------------------------------\data\ ngram 1=4989 ngram 2=835668 ngram 3=12345678 \1-grams: ... -0.9792 ABC -2.2031 ... log10_uniprob(ZWEIG) ZWEIG log10_alpha(ZWEIG) \2-grams: ... -0.8328 ABC DEFG -3.1234 ... log10_bo_biprob(WAS | ZWEIG) ZWEIG WAS log10_bialpha(ZWEIG,WAS) \3-grams: ... -0.234 ABCD EFGHI JKL ... \end\ */








December 8th, 2009 at 12:43 pm
Thanks, there is not much concise documentation on this.
I interpreted these scores the following way, but I am not quite sure whether thats correct:
If you want to have a score for say a trigram, add the left and right score of the longest ngram match (cutting off the oldest words in the history) with your trigram.