ARPA Language Model File Format

The format of a ARPA language model file, as far as I can tell, is not documented outside of the CMU SLM toolkit source code. I’m regurgitating it here in the hopes that future Google searches on this topic are more fruitful than mine were. :)

/* This is the format introduced and first used by Doug Paul.
   Optionally use a given symbol for the UNK word (id==0).
*/
/*
Format of the .arpabo file:
------------------------------
\data\ ngram 1=4989 ngram 2=835668 ngram 3=12345678 \1-grams: ... -0.9792 ABC -2.2031 ... log10_uniprob(ZWEIG) ZWEIG log10_alpha(ZWEIG) \2-grams: ... -0.8328 ABC DEFG -3.1234 ... log10_bo_biprob(WAS | ZWEIG) ZWEIG WAS log10_bialpha(ZWEIG,WAS) \3-grams: ... -0.234 ABCD EFGHI JKL ... \end\ */

2 Responses to “ARPA Language Model File Format”

  1. ben Says:

    Thanks, there is not much concise documentation on this.
    I interpreted these scores the following way, but I am not quite sure whether thats correct:

    If you want to have a score for say a trigram, add the left and right score of the longest ngram match (cutting off the oldest words in the history) with your trigram.

  2. Kyle Says:

    The SRI language model toolkit has a man page with some information on this format.

    http://www.speech.sri.com/projects/srilm/manpages/ngram-format.5.html

Leave a Reply


<Kered.org>   © Copyright 2000-2005 by Derek Anderson
Get Firefox