Ngram Statistics Package
                          =========================
                                 Version 0.53
	 		       January 14, 2002				

                           Copyright (C) 2000-2003

                       Ted Pedersen, tpederse@umn.edu  
                    Satanjeev Banerjee, bane0025@d.umn.edu               
                    Amruta Purandare, pura0010@d.umn.edu               

                       University of Minnesota, Duluth 

                   http://www.d.umn.edu/~tpederse/code.html


1. Introduction:
----------------

The Ngram Statistics Package (NSP) is a suite of programs that aids 
in analyzing Ngrams in text files. We define an Ngram as a sequence 
of 'n' tokens that occur within a window of at least 'n' tokens in 
the text; what constitutes a "token" can be defined by the user.

In earlier versions (v0.1, v0.3, v0.4) this package was known as
the Bigram Statistics Package (BSP). The name change reflects the 
widening scope of the package in moving beyond Bigrams to Ngrams. 

NSP version 0.53 consists of four programs:

Program count.pl takes flat text files as input and generates a list of 
all the Ngrams that occur in those files. The Ngrams, along with their 
frequencies, are output in descending order of their frequency. 

Program statistic.pl takes as input a list of Ngrams with their 
frequencies (in the format output by count.pl) and runs a user-selected 
statistical measure of association to compute a "score" for each Ngram. 
The Ngrams, along with their scores, are output in descending order of 
this score. The statistical score computed for each Ngram can be used to 
decide whether or not there is enough evidence to reject the null 
hypothesis (that the Ngram is not a collocation) for that Ngram. 

Program Utils/rank.pl takes as input two files output by statistic.pl and  
computes the Spearman's rank correlation coefficient on the Ngrams that  
are common to both files. Typically the two files should be produced by   
applying statistic.pl on the same Ngram count file but by using two   
different statistical measures. In such a scenario, the value output by   
rank.pl can be used to measure how similar these the two measures are. A   
value close to 1 would indicate that these two measures rank Ngrams in   
the same order, -1 that the two orderings are exactly opposite to each  
other and 0 that they are not related.

Program Utils/kocos.pl takes as input a file output by count.pl or  
statistic.pl and uses that to identify kth order co-occurrences of a   
given word. A kth order co-occurrence of a target WORD is a word that  
co-occurs with a (k-1)th co-occurrence of the given target WORD. So A is a   
2nd order co-occurrence of X if X occurs with B and B occurs with A.  
Put more concretely in "New York", "New" and "York" co-occur (the are 1st 
order co-occurrences). In "New Jack", "New" and "Jack" co-occur. Thus,
"Jack" and "York" are second order co-occurrences because they both 
co-occur with "New".

Note that as  of this version (0.01) kocos.pl is not particularly  
efficient for larger corpora so we have included a program socs.pl that  
can find 2nd order  co-occurrences very quickly. 

This README continues with an introduction to the basic definitions of
tokens, the tokenization process and the Ngram formation process. This
is followed by a description of the two main programs in this suite 
(count.pl and statistic.pl) and brief notes one how one could typically  
use each of them. The programs rank.pl and kocos.pl are described in 
separate READMEs in the /Utils directory.

2. Tokens:
----------

We define a token as a contiguous sequence of characters that match one of a
set of regular expressions. These regular expressions may be user-provided,
or, if not provided, are assumed to be the following two regular expressions: 

\w+        -> this matches a contiguous sequence of alpha-numeric characters
[\.,;:\?!] -> this matches a single punctuation mark

For example, assume the following is a line of text:

"the stock markets fell by 20 points today!"

Then, using the above regular expressions, we get the following tokens:

    the       stock     markets
    fell      by        20
    points    today     !

Now assume that the user provides the following lone regular expression:

[a-zA-Z]+  -> this matches a contiguous sequence of alphabetic characters

Then, we get the following tokens:

    the       stock     markets
    fell      by        points
    today

3. The Tokenization Process:
----------------------------

Given a text file and a set of regular expressions, the text is "tokenized",
that is, broken up into tokens. To do so, the entire input text is considered
as one long "input string" with new-line characters being replaced by space
characters (this is the default behaviour and can be modified; see point 4
below). Then, the following is done:

while the input string is non empty

    foreach regular expression r
        if r is matched by a sequence of characters starting with the first
        character in the input string...
            quit this for loop
        end if
    end foreach

    if we have a matching regular expression r
        the portion of the input string matched by r is our next token. remove
        this token from the input string.
    else
        remove the first character from the input string
    end if

end while

3.1. Notes:
-----------

3.1.1. In looking for a regular expression that yields a successful match (in
the foreach loop above), we want a regular expression that matches the input
string starting with the first character of the input string. Thus, the
regular expression /b/ matches the input string "be good" but not the input
string " be good".

3.1.2. If none of the regular expressions give a successful match, then the
first character in the input string is removed. This character is considered
a "non-token" and is henceforth ignored.

3.1.3. Since the matching process (the foreach loop above) stops at the first
match, the order in which the regular expressions are tested is important.
The order is exactly the order in which they are provided by the user, or if
the default regular expressions are used, the order in which they are listed
above.

3.2. Examples:
--------------

3.2.1. Example 1:
-----------------

3.2.1.1. Input text:

    why's the stock falling?

3.2.1.2. Regular expressions:

    \w+
    [\.,;:\?!]

3.2.1.3. Resulting tokens: 

    why       s         the    
    stock     falling   ?

3.2.1.4. Explanation:

Initially our input string is the entire input text: "why's the stock
falling?". The first token found is "why" which matches the regular
expression /\w+/. This token is removed, and our input string becomes "'s the
stock falling?".

Now neither of the regular expressions can match the ' character. Thus this
character is considered a non-token and is removed, leaving the input string
like so: "s the stock falling?".

"s" is now matched by /\w+/, and this forms our next token. Upon removing
this token, we get the following input string " the stock falling?".

Again, neither of the regular expressions match this input string, and the
leading space character is removed as a non-token. Similarly the rest of the
line is tokenized to yield the tokens "the", "stock", "falling" and "?".

3.2.2. Example 2:
-----------------

3.2.2.1. Input text:

    why's the stock falling?

3.2.2.2. Regular expressions:

    /fall/
    /falling/
    /stock/

3.2.2.3. Resulting tokens: 

    stock     fall

3.2.2.4. Explanation:

Initially our input string is the entire input text: "why's the stock
falling?". None of the regular expressions match, and we remove the first
character to get as input string the following: "hy's the stock falling?".
Similarly, again the regular expressions don't match, and we have to remove
the first character. This goes on until our input string becomes: "stock
falling?".

Now "stock" matches the regular expression /stock/, and this token is removed,
leaving " falling?" as the input string. Since the space character does not
form a token, it is removed. Now we have "falling?" as our input string.

Now observe that we have two regular expressions, /fall/ and /falling/, both
of which can match the input string. However, since /fall/ appears before
/falling/ in the list, the token formed is "fall". This leaves our input
string as: "ing?". None of the regular expressions match this or any of the
subsequent input strings obtained by removing one by one the first characters.
Hence we get as tokens "stock" and "fall".

3.2.3. Example 3:
-----------------

3.2.3.1. Input text:

    why's the stock falling?

3.2.3.2. Regular expressions:

    /falling/
    /fall/
    /stock/

3.2.3.3. Resulting tokens: 

    stock     falling

3.2.3.4. Explanation:

Observe that this example differs from the previous one only in the order
of the regular expressions. The tokenization proceeds exactly as in the
previous example, until we have as our input string "falling?". Here, we
have /falling/ as our first regular expression, and so we get "falling" as our
token.

Examples 3.2.2 and 3.2.3 demonstrate the importance of the order in which the
regular expressions are provided to the tokenization process.

3.2.4. Example 4:
-----------------

3.2.4.1. Input text:

    why's the stock falling?

3.2.4.2. Regular expressions:

    /the stock/
    /\w+/

3.2.4.3. Resulting tokens:

    why       s       the stock
    falling

3.2.4.4. Explanation:

The thing to note here is that one of the regular expressions has an embedded
space character in it. This causes no problems: our definition of a token
allows embedded space characters in them! Once our input string is "the stock
falling?", the regular expression /the stock/ is matched, and the string "the
stock" forms our next token.

4. Ngrams:
----------

An Ngram is a sequence of n tokens. We shall delimit tokens in an Ngram by
the diamond symbol, i.e. "<>". Thus, "big<>boy<>" is a bigram whose tokens
are "big" and "boy". Similarly, "stock<>falling<>?<>" is a trigram whose
tokens are "stock" and "falling" and "?". "the stock<>falling<>" is a
bigram with tokens "the stock" and "falling".

Given a piece of text, Ngrams are usually formed of contiguous tokens. For
instance, lets take example 3.2.1, where our tokens, in the order in which
they appear in the text, are the following:

    why      s      the      stock      falling      ?

Then, the following are all the bigrams:

    why<>s<>            s<>the<>        the<>stock<>
    stock<>falling<>    falling<>?<>

The following are all the trigrams:

    why<>s<>the<>           s<>the<>stock<>
    the<>stock<>falling<>   stock<>falling<>?<>

The following are all the 4-grams:

    why<>s<>the<>stock
    s<>the<>stock<>falling
    s<>the<>stock<>falling<>?<>

Etcetera.

The Ngrams shown above are all formed from contiguous tokens. Although
this is the default, we also allow Ngrams to be formed from non-contiguous
tokens.

To do so, we first define a "window" of size k to be a sequence of k
contiguous tokens, where the value of k is greater than or equal to
the value of n for the Ngrams. An Ngram can be formed from any
n tokens as long as all the tokens belong to a single window of size
k. Further the n tokens must occur in the Ngram in exactly the same
order as they occur in the window. 

Put another way, given a window of k tokens, we drop k-n tokens from
the window, and what remains is an Ngram! 

Thus for instance, taking example 3.2.1 again, recall that our tokens
in the order in which they occur in the text are the following: 

    why      s      the      stock      falling      ?

Then, the following are all the bigrams with a window size of 3:

    why<>s<>               why<>the<>         s<>the<>    
    s<>stock<>             the<>stock<>       the<>falling<>
    stock<>falling<>       stock<>?<>         falling<>?<>

The following are all the bigrams with a window size of 4:

    why<>s<>               why<>the<>         why<>stock<>
    s<>the<>               s<>stock<>         s<>falling<>
    the<>stock<>           the<>falling<>     the<>?<>
    stock<>falling<>       stock<>?<>         falling<>?<>

The following are all the trigrams with a window size of 4:

    why<>s<>the<>          why<>s<>stock<>     why<>the<>stock<>
    s<>the<>stock<>        s<>the<>falling<>   s<>stock<>falling<>
    the<>stock<>falling<>  the<>stock<>?<>     the<>falling<>?<>
    stock<>falling<>?<>


Etc. 

5. Program count.pl:
--------------------

This program takes as input a flat ASCII text file and outputs all
Ngrams, or token sequences of length 'n', where the value of 'n' can
be decided by the user. Non-contiguous Ngrams within a window of size
'k' as described above can also be found and output. For every output
Ngram, its frequency of occurrence as well as the frequencies of all
the combinations of the tokens it is made up of are output. Details
follow.

5.1. Default Way to Run count.pl:
---------------------------------

The most basic way of running this program is the following: 

Example 5.1: count.pl output.txt input.txt 

where input.txt is the input text file in which to find the Ngrams and
output.txt is the output file into which count.pl will put all the
Ngrams with their frequencies. 

5.2. Changing the Length of Ngrams and the Size of the Window:
--------------------------------------------------------------

Several default values are in use when the program is run this
way. For example it is assumed that one is counting bigrams, that is
the value of 'n' is 2. This can be changed by using the option --ngram
N, where 'N' is the number of tokens you want in each Ngram. Thus, to
find all trigrams in input.txt, run count.pl thus:

Example 5.2: count.pl --ngram 3 output.txt input.txt

Another default value in use is the window size. Window size defaults
to the value of 'n' for Ngrams. Thus, in example 5.1 the window size
was 2 while in example 5.1, because of the --ngram 3 option , the
window size was 3. This can be changed using the --window N
option. Thus, for example to find all bigrams within windows of size
3, one would run the program like so:

Example 5.3a: count.pl --window 3 output.txt input.txt

Similarly, to find all trigrams within a window of size 4: 

Example 5.3b: count.pl --ngram 3 --window 4 output.txt input.txt

5.3. Using User-Provided Token Definitions:
-------------------------------------------

In all these examples, the tokenization and Ngram formation proceeds
as described in sections 3 and 4 above. In these examples, the default
token definitions are used: 

\w+        -> this matches a contiguous sequence of alpha-numeric characters
[\.,;:\?!] -> this matches a single punctuation mark

As mentioned previously, these default token definitions can be
over-ridden by using the option --token FILE, where FILE is the name
of the file containing the regular expressions on which the token
definitions will be based. Each regular expression in this FILE should
be on a line of its own, and should be delimited by the forward slash
'/'. Further, these should be valid Perl regular expressions, as
defined in [1], which means for example that any occurrence of the
forward slash '/' within the regular expression must be 'escaped'. 

5.4 Removing character strings via --nontoken option:
-----------------------------------------------------

This option allows a user to define regular expressions that will match 
strings that should not be considered as tokens. These strings will be 
removed from the data and not counted or included in Ngrams. 

The --nontoken option is recommended when there are predictable sequences 
of characters that you know should not be included as tokens for purposes 
of counting Ngrams, finding collocations, etc. 

For example, if mark-up symbols like <s>, <p>, [item], [/ptr] exist in 
text being processed, you may want to include those in your list of
nontoken items so they are discarded. If not, a simple regex such as
/\w+/ will match with 's', 'p', 'item', 'ptr' from these tags, leading
to confusing results. 

The --nontoken option on the command line should be followed by a file 
name (NON_TOKEN). This file should contain Perl regular expressions  
delimited by forward slashes '/' that define non-tokens. Multiple 
expressions may be placed on separate lines or be separated via the '|'  
(Perl 'or') as in /regex1|regex2|../ 

The following are some of the examples of valid non-token definitions. 

/<\/?s|p>/ : will remove xml tags like <s>, <p>, </s>, </p>. 
/\[\w+\]/  : will remove all words which appear in square brackets like  
	       [p], [item], [123] and so on. 

count.pl will first remove any string from the input data that matches the  
non-token regular expression, and only then will match the remaining data 
against the token definitions. Thus, if by chance a string matches both  
the token and nontoken definitions, it will be removed as  --nontoken has  
a higher priority than --token or the default token definition. 

5.5. The Output Format of count.pl:
-----------------------------------

Assume that the following are the contents of the input text file to
count.pl; let us call the file test.txt: 

first line of text
second line
and a third line of text

Further assume that count.pl is run like so: 

count.pl test.cnt test.txt 

Thus, test.cnt will have all the bigrams found in file test.txt using
a window size of 2 and using the two default tokens as
above. Following then are the contents of file test.cnt:

11
line<>of<>2 3 2 
of<>text<>2 2 2 
second<>line<>1 1 3 
line<>and<>1 3 1 
and<>a<>1 1 1 
a<>third<>1 1 1 
first<>line<>1 1 3 
third<>line<>1 1 3 
text<>second<>1 1 1 

The number on the first line, 11, indicates that there were total 11
bigrams in the input file. 

From the next line onwards, the various bigrams found are listed. Recall  
that the tokens of the Ngrams are delimited by the diamond signs: <>.  
Thus the bigram on the first line is line<>of<>, made up of the tokens  
"line" and "of" in that order; the bigram on the second line is  
of<>text<>, made up of the tokens "of" and "text", etc. 

After the diamond following the last token there are three numbers. The  
first of these numbers denotes the number of times this Ngram occurs in   
the input text file. Thus bigram line<>of<> occurs 2 times in the input  
file, as does bigram of<>text<>. The second number denotes in how many  
bigrams the token "line" occurs as the left-hand-token. In this case,  
"line" occurs on the left of three bigrams, namely two copies of bigram  
"line<>of" and the bigram "line<>and<>". Similarly, the third number  
denotes the number of bigrams in which the word "of" occurs as the  
right-hand-token. In this case, "of" occurs on the right of two bigrams,  
namely the two copies of the bigram "line<>of<>". 

Similar output is obtained for trigrams. Assume again that the input file  
is above, and assume that count.pl is run thusly:

count.pl --ngram 3 test.cnt test.txt 

The output test.cnt file is as follows: 

10
line<>of<>text<>2 3 2 2 2 2 2 
and<>a<>third<>1 1 1 1 1 1 1 
third<>line<>of<>1 1 3 2 1 1 2 
second<>line<>and<>1 1 3 1 1 1 1 
line<>and<>a<>1 3 1 1 1 1 1 
a<>third<>line<>1 1 1 2 1 1 1 
text<>second<>line<>1 1 1 2 1 1 1 
of<>text<>second<>1 1 1 1 1 1 1 
first<>line<>of<>1 1 3 2 1 1 2 

Once again, the number on the first line says that there are 10 trigrams  
in the input text file. The first trigram in the list is 
"line<>of<>text<>" made up of the tokens "line", "of" and "text" in that  
order. Similarly, the next trigram is "and<>a<>third<>" made of the tokens  
"and", "a" and "third". 

Observe that this time there are more numbers after the last token. The  
first number denotes, as before, the number of times this trigram occurs  
in the input text file. Thus, "line<>of<>text" occurs twice in the input  
file while "and<>a<>third" occurs just once. The second, third and fourth  
numbers denote the number of trigrams in which the tokens "line", "of" and  
"text" appear in the first, second and third positions respectively. Thus,  
"line" occurs as the token in the first position in 3 trigrams, namely 2  
copies of "line<>of<>text<>" and one copy of "line<>and<>a<>". Similarly,  
the tokens "of" and "text" appear as the second and third tokens 
respectively of two bigrams, namely the two copies of "line<>of<>text<>". 

The fifth number denotes the number of bigrams in which "line" occurs as  
the first token and "of" occurs as the second token. Once again, there are  
only two trigrams in which this happens: the two copies of 
"line<>of<>text<>". The sixth number denotes the number of bigrams in 
which "line" occurs as the token in the first place and "text" occurs as  
the token in the third place. The seventh number denotes the number of  
bigrams in which "of" occurs as the token in the second place and "text" 
occurs as the token in the third place. 

In general, assume we are dealing with Ngrams of size 'n'. Given an 
Ngram, denote its leftmost token as w[0], the next token as w[1], and
so on until w[n-1]. Further let f(a, b, ..., c) be the number of Ngrams  
that have token w[a] in position a, token w[b] in position b, ... and  
token w[c] in position c, where 0 <= a < b < ... < c < n. 

Then, given an ngram, the first frequency value reported is f(0, 1,
..., n-1). 

This is followed by n frequency values, f(0), f(1), ..., f(n-1). 

This is followed by (n choose 2) values, f(0, 1), f(0, 2), ..., f(0,
n-1), f(1, 2), ..., f(1, n-1), ... f(n-2, n-1). 

This is followed by (n choose 3) values, f(0, 1, 2), f(0, 1, 3), ...,
f(0, 1, n-1), f(0, 2, 3), ..., f(0, 2, n-1), ..., f(0, n-2, n-1), ...,
f(1, 2, 3), ..., f(n-3, n-2, n-1). 

And so on, until (n choose n-1), that is n, frequency values f(0, 1,
..., n-2), f(0, 1, ..., n-3, n-1), f(0, 1, ..., n-4, n-2, n-1), ...,
f(1, 2, ..., n-1).

This gives us a total of 2^n-1 possible frequency values. We call each
such frequency value a "frequency combination", since it expresses the
number of Ngrams that has a given combination of one or more tokens in
one or more fixed positions. By default all such combinations are printed,  
exactly in the order showed above. To see which combinations are being  
printed one could use the option --get_freq_combo FILE. This prints to the  
file the inputs to the imaginary 'f' function defined above exactly in the  
order the frequency values occur in the main output. Thus for instance, 
running the program like so: 

count.pl --get_freq_combo freq_combo.txt test.cnt test.txt 

Assuming that test.txt file is the one shown above, the following
output is created in file freq_combo.txt:

0 1 
0 
1 

and the following output in file test.cnt: 

11
line<>of<>2 3 2 
of<>text<>2 2 2 
second<>line<>1 1 3 
line<>and<>1 3 1 
and<>a<>1 1 1 
a<>third<>1 1 1 
first<>line<>1 1 3 
third<>line<>1 1 3 
text<>second<>1 1 1 

Recall that since the option --ngram is not being used, the default
value of n, 2, is being used here. After each bigram in the test.cnt
file are three numbers; the first number corresponds to f(0, 1), the
second number corresponds to f(0) and the third to f(1). Observe that
line 'i' of the output in file freq_combo.txt file represents the
input to the imaginary 'f' function that creates the 'i_th' frequency
value on each line of the output in file test.cnt.

Similarly, running the program thus: 

count.pl --ngram 3 --get_freq_combo freq_combo.txt test.cnt test.txt 

produces the following output in freq_combo.txt:

0 1 2 
0 
1 
2 
0 1 
0 2 
1 2 

and the following output in file test.cnt

10
line<>of<>text<>2 3 2 2 2 2 2 
and<>a<>third<>1 1 1 1 1 1 1 
third<>line<>of<>1 1 3 2 1 1 2 
second<>line<>and<>1 1 3 1 1 1 1 
line<>and<>a<>1 3 1 1 1 1 1 
a<>third<>line<>1 1 1 2 1 1 1 
text<>second<>line<>1 1 1 2 1 1 1 
of<>text<>second<>1 1 1 1 1 1 1 
first<>line<>of<>1 1 3 2 1 1 2 

The seven numbers after each trigram in file test.cnt correspond
respectively to f(0, 1, 2), f(0), f(1), f(2), f(0, 1), f(0, 2) and
f(1, 2), as shown in the file freq_combo.txt.

It is possible that the user may not require all the frequency values
output by default, or that the user requires the frequency values in a
different order. To change the default frequency values output, one
may provide count.pl with a file containing the inputs to the 'f'
function using the option --set_freq_combo. 

Thus for instance, if the user wants to create trigrams, and only
requires the frequencies of the trigrams and the frequency values of
the three tokens in the trigrams (and not of the pairs of tokens),
then he may create the following file (say, user_freq_combo.txt):

0 1 2
0
1
2

and provide this file to the count.pl program thus: 

count.pl --ngram 3 --set_freq_combo user_freq_combo.txt test.cnt test.txt

this produces the following test.cnt file: 

10
line<>of<>text<>2 3 2 2 
and<>a<>third<>1 1 1 1 
third<>line<>of<>1 1 3 2 
second<>line<>and<>1 1 3 1 
line<>and<>a<>1 3 1 1 
a<>third<>line<>1 1 1 2 
text<>second<>line<>1 1 1 2 
of<>text<>second<>1 1 1 1 
first<>line<>of<>1 1 3 2 

Observe that the only difference between this output and the default
output is that instead of reporting 7 frequency values per ngram, only
the 4 requested are output. 

5.6. "Stopping" the Ngrams: 
---------------------------

The user may "stop" the Ngrams formed by count.pl by providing a list
of stop-tokens through the option --stop FILE. Each stop token in FILE 
should be a Perl regular expression that occurs on a line by itself. 
This expression should be delimited by forward slashes, as in /REGEX/.
All regular expression capabilities in Perl are supported except for
regular expression modifiers (like the "i" /REGEX/i). 

The following are a few examples of valid entries in the stop list.

/^\d+$/
/\bthe\b/
/\b[Tt][Hh][Ee]\b/
/^and$/
/\bor\b/
/^be(ing)?$/

There are two modes in which a stop list can be used, AND and OR. The 
default mode is AND, which means that an Ngram must be made up entirely 
of words from the stoplist before it is eliminated. The OR mode eliminates 
an Ngram if any of the words that make up the Ngram are found in the 
stoplist.

The mode is specified via an extended option that should appear on the
first line of the stop file. For example,

@stop.mode=AND
/^for$/
/^the$/
/^\d+$/

would eliminate bigrams such as 'for the', 'for 10', etc.  (where both
elements of the bigram are from the stop list.) But will not remove bigrams
like '10 dollars' or 'of the'.

@stop.mode=OR
/^for$/
/^the$/
/^\d+$/

would eliminate bigrams such as 'for our', '10 dollars', etc. (where 
at least one element of the bigram is from the stop list).

If the @stop.mode= option is not specified, the default value is AND.

In both modes, Ngrams that are eliminated do not add to the  various
Ngram and individual word frequency counts. Ngrams that are "stoplisted"
are treated as if they never existed and are not counted.

5.6.1 Usage Notes for Regular Expressions in Stop Lists:
--------------------------------------------------------

(1) In Perl regular expressions, '\b' specifies word boundary and '^' and  
'$' specify the start and end of a line. However, these have the same  
effect when used in a stop list for count.pl. As a result, if you want 
to exactly match a particular word, you can use either/\bregex\b/ or   
'/^regex$/'. Despite this equivalence, We recommend the use of \b' since  
in a stop list it makes more sense to think in terms of word boundaries. 

(2) If instead of /^the$/, regex /the/ is used as a stop regex, then every  
token that matches /the/ will be removed. So tokens like 'there', 'their', 
'weather','together' will be excluded with the stop regex /the/. On the  
other hand, with the regex /^the$/, all occurrences of only word 'the' 
will  be removed. 

(3) You can also use a stop regex /^the/ to remove tokens that begin with
'the' like 'their' or 'them' but not 'together'. Similarly, stop regex 
/the$/ will remove all tokens which end in 'the' like 'swathe' or 'tithe' 
but not 'together' or 'their'.

(4) Please note that if you use a stoplist with version 0.53 that you 
used with an earlier version, it will not behave the same way! In earlier 
versions when you specified /regex/ as a stoplist item, we assumed
that you really meant /\bregex\b/ and proceeded according. However, since 
regular expressions are now fully supported we require that you specify 
exactly what you mean. So if you include /is/ as a member of your 
stoplist, we will now assume that you mean any word that contains 'is' 
somewhere within in (like 'this' or 'kiss' or 'isthmus' ...) To preserve
the functionality of your old stoplists, simply convert them from 

/the/
/is/
/of/

to 

/\bthe\b/
/\bis\b/
/\bof\b/

(6) regex modifiers like i or g which come after the end slash like:

/regex/i 
/regex/g

are not supported. See FAQ.txt for an explanation.

This makes it slightly inconvenient to specify that you would like
to stop any form of a given word. For example, if you wanted to 
stop 'THE', 'The', 'THe', etc. you would have to specify a regex
such as

/[Tt][Hh][Ee]/

5.6.2. Differences between --nontoken and --stop 
------------------------------------------------

In theory we can remove "unwanted" words using either the --nontoken  
option or the --stop option. However, these are rather different  
techniques. 

--stop only removes stop words after they are recognized as valid tokens. 
Thus, if you wish to remove some markup tags like [p]  or [item] from the 
data using a stop list, you first need to recognize these as tokens (via  
a --token definition like /\[\w+\]/) and then remove them with a --stop 
list. 

In addition, the --stop option operates on an Ngram and does not 
remove individual words. It removes Ngrams (and reduces the count of the 
number of Ngrams in the sample). In other words, the --stop option only 
comes into effect after the Ngrams have been created. 

On the other hand, the --nontoken option eliminates individual occurrence of 
a non-token sequence before finding Ngrams. 

Some examples to clarify the distinction between --stop and --nontoken 
-----------------------------------------------------------------------

Consider an input file count.input => 

	[ptr] <s> this is a test written for count.pl </s> [/ptr]
	their them together wither tithe

NontokenFile nontoken.regex => 

	/\[\/?\w+\]/
	/<\/?\w+>/

case (a) StopFile stopfile.txt => /the/
----------------------------------------

Running count.pl with the command :

count.pl --stop stopfile.txt --nontoken nontoken.regex count.out count.input

will first remove all nontokens from the input file. Hence the tokenized 
text from which the bigrams will be created will be =>
	
	this is a test written for count.pl 
	their them together wither tithe

Since the StopFile contains /the/ all tokens which include 'the' are 
eliminated. Thus, the bigrams:

their<>them<>
them<>together<>
together<>wither<>
wither<>tithe<> 

will all be removed. This is because each word in each bigram contains 
"the" and the default stop mode is AND. Note that if there was a bigram  
such as "on<>their<>" it would not  be removed since both words to not  
match the stoplist. The output file count.out will contain the following:

count.out=>

9
test<>written<>1 1 1
this<>is<>1 1 1
a<>test<>1 1 1
is<>a<>1 1 1
for<>count<>1 1 1
.<>pl<>1 1 1
count<>.<>1 1 1
written<>for<>1 1 1
pl<>their<>1 1 1
		
case (b) StopFile stopfile.txt => /^the/
----------------------------------------

Running count.pl with the command:

count.pl --stop stopfile.txt --nontoken nontoken.regex count.out count.input

will first remove all nontokens from the input file. The tokenized text 
will be:

        this is a test written for count.pl
        their them together wither tithe

Since the StopFile contains /^the/, all tokens which begin with "the"
are eliminated. Thus, the bigram 

their<>them<>

will be removed since it consists of two words that begin with "the". The  
output file count.out will contain the 12 bigrams as shown below. 

count.out=>

12
test<>written<>1 1 1
this<>is<>1 1 1
a<>test<>1 1 1
is<>a<>1 1 1
for<>count<>1 1 1
them<>together<>1 1 1
.<>pl<>1 1 1
count<>.<>1 1 1
written<>for<>1 1 1
pl<>their<>1 1 1
wither<>tithe<>1 1 1
together<>wither<>1 1 1

case (c) StopFile stopfile.txt => @stop.mode=OR 
				  /the$/
------------------------------------------------

Running count.pl with the command:

count.pl --stop stopfile.txt --nontoken nontoken.regex count.out count.input

will first remove all nontokens from the input file. Hence the tokenized 
text will be:

        this is a test written for count.pl
        their them together wither tithe

As the StopFile contains /the$/ all tokens which end in 'the' are stop 
words. Thus, in the bigram

wither<>tithe<> 

"tithe" will match the stoplist since it ends with "the". However, this 
bigram will be eliminated since the stop mode is OR (meaning that if 
either word is in the stop list then the bigram is eliminated). The output 
file count.out will contain the 12 bigrams as shown below. 

count.out=>

12
test<>written<>1 1 1
this<>is<>1 1 1
a<>test<>1 1 1
is<>a<>1 1 1
for<>count<>1 1 1
them<>together<>1 1 1
.<>pl<>1 1 1
their<>them<>1 1 1
count<>.<>1 1 1
written<>for<>1 1 1
pl<>their<>1 1 1
together<>wither<>1 1 1

5.7. Removing and Not Displaying Low Frequency Ngrams:
------------------------------------------------------

We allow the user to either remove or to not display low frequency
Ngrams. The user can remove low frequency Ngrams by using the option 
--remove N by which all Ngrams that occur less than n times are
removed. The Ngram and the individual frequency counts are adjusted
accordingly upon the removal of these Ngrams. 

The user can choose not to display low frequency Ngrams by using the
option --frequency N, by which Ngrams that occur less than n times
are not displayed in the output. Note that this differs from the
--remove option above in that the various frequency counts are not
changed. Intuitively, we continue to believe that these Ngrams have
occurred in the text - we are simply not interested in looking at
them. By contrast, in the --remove option we want to actually think
that the Ngrams didn't occur in the text in the first place, and so we
want our numbers to agree to that too!

5.8. Extended Output: 
---------------------

Observe that one may modify the actual counting process in various ways
through the various options above. To keep a "record" of which
option were used and with what values, one can turn the "extended"
output on with the switch --extended. The extended output records the
size of the Ngram, the size of the window, the frequency value at
which the Ngrams were removed and a list of all the source files used
to create the count output. If a switch was not used, the default
value is printed.

5.9. Histogram Output: 
----------------------

The user can also generate a "histogram" output by using the
--histogram FILE option. This histogram output shows how many times
Ngrams of a certain frequency has occurred. Following is a typical
line out of a histogram output:

Number of n-grams that occurred   5 time(s) =    14 (40.94 percent)

This says that there were 14 distinct Ngrams that occurred 5 times
each, and between themselves they make up around 41% of the total
number of Ngrams.

5.10. Searching for Source Files in Directories, Recursively if Need Be: 
-----------------------------------------------------------------------

One would usual provide a source file to create Ngrams from. One
could also provide a directory name - all text files from the
directory are used to create Ngrams from. Along with a directory name
if one also uses the switch --recurse, all subdirectories inside the
source directory are searched for text files recursively, and all text
files so found are used to create Ngrams from.

6. Program statistic.pl:
------------------------

Program statistic.pl takes as input a list of Ngrams with their
frequencies in the format output by count.pl and runs a user-selected
statistical measure of association to compute a "score" for each Ngram.  
The Ngrams, along with their scores, are output in descending order of  
this score. 

The statistical measures of association are implemented separately in
separate Perl packages (files ending with .pm extension). When running
statistic.pl, the user needs to provide the name of a statistical
measure (either from among the ones provided as a part of this
distribution or those written by the user). Say the name of the
statistic provided by the user is X. Program statistic.pl will then
look for Perl package X.pm (in the current directory, or, failing
that, the system path). If found, this Perl package file will be
loaded and then used to calculate the statistic on the list of Ngrams
provided. 

Please remember to include the path of Measures Directory (in the main 
NSP Package directory) in your system path. This will enable the statistic.pl 
program to find the modules provided with this package. 

As a part of this distribution, we provide the following statistical
packages: dice, log-likelihood (ll), mutual information (mi), the
chi-squared test (x2), and the left-fisher test of associativity
(leftFisher). All these packages follow a fixed set of rules as
discussed below. It is hoped that these rules are easy to follow and
that new packages may be written quickly and easily.  

In a sense, program statistic.pl is framework. Its job is to take as
input Ngrams with their frequencies, to provide those frequencies to
the statistical library and to format the output from that library. The
heart of the statistical measure - the actual calculation - lies in
the library that can be plugged in. This framework allows for quickly
rigging up new measures; to do so one need worry only about the actual
calculation, and not of the various mundane issues that are taken care
of by statistic.pl.

This section follows with details on how to run statistic.pl, and then
the format of the libraries and tips on how to write them. 

6.1. Default Way to Run statistic.pl:
-------------------------------------

The default way to run statistic.pl is so:

statistic.pl dice test.dice test.cnt

where: dice      is the name of the statistic library to be loaded.
       test.dice is the name of the output file in which the results
		 of applying the dice coefficient will be stored.
       test.cnt  is the name of the input file containing the Ngrams
		 and their various frequency values. 


A Perl package with filename dice.pm is searched for, first in the
current directory and then in all directories pointed to by the system
path (not the Perl @INC path). Instead of writing just "dice" on the
command line, one may also write the file name "dice.pm".

Once such a file is found, it is exported into statistic.pl and two
tests are done to see if this file has the minimum requirements for a
statistical library (more details below). If these tests fail,
statistic.pl stops with an error message. Otherwise the library is
initialized and then for each Ngram in file test.cnt, its frequency
values are passed to it and its calculated value is noted. Finally,
when all values have been calculated, the Ngrams are sorted on their
statistic value and output to file test.dice. 

For example, assume our input test.cnt file is this: 

11
line<>of<>2 3 2 
of<>text<>2 2 2 
second<>line<>1 1 3 
line<>and<>1 3 1 
and<>a<>1 1 1 
a<>third<>1 1 1 
first<>line<>1 1 3 
third<>line<>1 1 3 
text<>second<>1 1 1 

Thus there are 11 bigrams, the first of which is "line<>of<>", the
second "of<>text<>" etc. 

Running statistic.pl thusly: statistic.pl dice test.dice test.cnt 
will produce the following test.dice file:

11
of<>text<>1 1.0000 2 2 2
and<>a<>1 1.0000 1 1 1
a<>third<>1 1.0000 1 1 1
text<>second<>1 1.0000 1 1 1
line<>of<>2 0.8000 2 3 2
third<>line<>3 0.5000 1 1 3
line<>and<>3 0.5000 1 3 1
second<>line<>3 0.5000 1 1 3
first<>line<>3 0.5000 1 1 3

Once again, the first number is the total number of bigrams - 11. On
the next line is the highest ranked bigram "of<>text<>". The first
number following this bigram, 1, is its rank. The next number, 1.0000,
is its value computed using the dice statistic. The final three
numbers are exactly the numbers associated with this Ngram in the
test.cnt file.

Observe that three other bigrams also have the same score of 1.000 and
so the same rank 1. The bigram with the next highest score of 0.8000,
"line<>of<>", is ranked 2nd instead of 5th. This is a feature of our
ranking mechanism; the fact that a bigram has a rank 'r' implies that
there are r-1 distinct scores greater than the score of this Ngram. It
does not imply that there are r-1 bigrams with higher scores. 

6.2. Changing the Default Ngram Size:
-------------------------------------

By default, the Ngrams in the input file are assumed to be
bigrams. This can however be changed by using the option 
--ngram. Given an Ngram size (either by default or by using the
--ngram option), statistic.pl checks if there are exactly the correct
number of tokens in each Ngram. If this is not true, an error is
printed and statistic.pl halts.

6.3. Defining the Meaning of the Frequency Values:
--------------------------------------------------

The "meaning" of the various frequency values after each Ngram in the
input file is important in that the statistic calculated depends on
them. By default, the default meanings as defined by count.pl are
assumed. Thus for example, for bigrams three frequency values are
expected in the order (0 1), (0), (1). Similarly for trigrams seven
frequency values are expected in the order (0 1 2), (0), (1), (2), (0
1), (0 2), (1 2), etc. 

The frequency combinations being used can be output to a file by using
the option get_freq_combo. 

If count.pl was run with a set of user-defined frequency combinations
different from the defaults, then the file containing these frequency
combinations must be provided to statistic.pl using the option 
set_freq_combo. 

If the number of frequency values does not match the number expected
(either through the default frequency combinations or through the user
defined ones provided through the set_freq_combo option) then an error
is reported. Besides checking that the number of frequency values is
correct, nothing else is checked. 

6.4. Modifying the Output of statistic.pl:
------------------------------------------

One may request statistic.pl to ignore all Ngrams which have a
frequency less than a user-defined threshold by using the --frequency
option. To be able to do this however, the Ngram frequency should be
present among the various frequency values in the input Ngram file. It
is possible to set up a frequency combination file that prevents
count.pl from printing the actual frequency of each Ngram; if such a
file is given to statistic.pl, the frequency cut-off requested through
option --frequency will be ignored and a warning issued to that
effect. 

Once the statistical values for the Ngrams are calculated and the
Ngrams have been ranked according to these values, one may request not
to print Ngrams below a certain rank. This can be done using the
option --rank. Unlike the frequency cut-off above, all calculations
are done and then Ngrams that fall below a certain rank are
cut-off. In the frequency cut-off, calculations are not performed on
the Ngrams that are ignored. 

The value returned by the statistic libraries may be floating point
numbers; by default 4 places of decimal are shown. This can be changed
by using the option --precision through which the user can decide how
many places of decimal he wishes to see. Note that the values returned
by the library are rounded to the places of decimal requested by the
user, and THEN the ranking is done. Thus two Ngram that actually have
different scores, but whose scores both round up to the same number
for the given precision will get the same rank!

The user can also use the statistical score to cut off Ngrams. Thus,
using the option --score, one may request statistic.pl to not print
Ngrams that get a score less than the given threshold. 

Similar to count.pl, the user can request statistic.pl to print
extended information by using the --extended switch. Without this
switch, all extended information already in the input file will be
lost; with it, they will all be preserved and new extended data will
be output. 

The output of statistic.pl is not formatted for human eyes - this can
be done using the switch --format. Columns will be aligned as much as
possible and the output is (often) neater than the default output. 

6.5. The Measures of Association Provided in This Distribution:
---------------------------------------------------------------

We provide the 10 measures of association with this distribution. Nine are 
suitable for use with bigrams and one may be used with trigrams. 

The bigram measures are:

1) Dice Coefficient (dice.pm)
2) Fishers exact test - left sided (leftFisher.pm)
3) Fishers exact test - right sided (rightFisher.pm)
4) Log-likelihood ratio (ll.pm)
5) Mutual Information (tmi.pm)
6) Pointwise Mutual Information (pmi.pm)
7) Odds Ratio (odds.pm)
8) Phi Coefficient (phi.pm)
9) T-score (tscore.pm)
10) Pearson's Chi Squared Test (x2.pm)

The trigram measure is:

1) Log-likelihood ratio (ll3.pm)

Any of these measures can be used as follows:

statistic.pl XXXX output.txt input.txt 

where XXXX is the name of the .pm file (as in XXXX.pm)

More information on how to write a new statistic library is provided
in file Docs/NewStats.txt. A few additional details about the
Measures can be found in Measures/README.txt.

7. Referencing:
---------------

If you write a paper that has used NSP in some way, we'd certainly be 
grateful if you sent us a copy and referenced NSP. We have a published
paper about NSP that could provide a suitable reference:

@inproceedings{BanerjeeP03,
        author = {Banerjee, S. and Pedersen, T.},
        title = {The Design, Implementation, and Use of the 
		{N}gram {S}tatistic {P}ackage},     
	booktitle = {Proceedings of the Fourth International 
		Conference on Intelligent Text Processing and
		Computational Linguistics},
	year = {2003},
        month ={February},
        address = {Mexico City}}

This paper is found in Docs/cicling2003.(ps|pdf)

8. Copying:
-----------

This suite of programs is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by the
Free Software Foundation; either version 2 of the License, or (at your option) 
any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with 
this program; if not, write to the Free Software Foundation, Inc., 59 Temple 
Place - Suite 330, Boston, MA  02111-1307, USA.

Note: The text of the GNU General Public License is provided in the file 
GPL.txt that you should have received with this distribution. 

9. Acknowledgments:
-------------------

This work has been partially supported by a National Science Foundation 
Faculty Early CAREER Development award (\#0092784) and by a Grant-in-Aid   
of Research, Artistry and Scholarship from the Office of the Vice   
President for Research and the Dean of the Graduate School of the  
University of Minnesota.

10. References:
--------------

1. L. Wall, T. Christiansen and J. Orwant. 2000. Programming
   Perl. O'Reilly.