README for kocos.pl,version 0.01
			   =================================

			    Copyright (C) 2002-2003

			Amruta Purandare, pura0010@umn.edu
                          Ted Pedersen, tpederse@umn.edu

                         University of Minnesota, Duluth

----------------
1. Introduction
----------------

This program finds the Kth order co-occurrences of a given word. 

------------------------------------------------------------------------
WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING: 

For even modest sized corpora (a few thousand tokens) kocos.pl
can be *very* slow, especially for values of K greater than 2. In such  
cases please reduce the sample size as much as possible via stop lists  
(--stop option in count.pl) and via the elimination of low frequency  
bigrams (--reduce option in count.pl). The stop list is particularly 
important as high frequency words such as "the" or "is" will co-occur with 
many different words, and greatly expand the search needed to find kth 
order co-occurrences. 

We have included the program socs.pl as a more efficient alternative
for larger corpora. This will find 2nd order co-occurrences very quickly.
Please consult README.socs.txt for futher details. 

WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING WARNING: 
------------------------------------------------------------------------

Please note that this program requires the Set::Scalar module from the  
CPAN archive (http://search.cpan.org)

1.1 What are Kth order Co-occurrences ? 
---------------------------------------
Co-occurrences are the words which occur together in the same context. All 
words which co-occur with a given target word are called its co-occurrences. 
The concept of 2nd order co-occurrences is explained in the paper Automatic 
word Sense Discrimination [Schutze98]. According to this paper, the words 
which co-occur with the co-occurring words of a target word are called as the 
2nd order co-occurrences of that word. 

So with each increasing order of co-occurrences, we introduce an extra level 
of indirection and find words co-occurring with the previous order 
co-occurrences.  

We generalize the concept of 2nd order co-occurrences from [Schutze98] to find
the Kth order co-occurrences of a word. These are the words that co-occur 
with the (K-1)th order co-occurrences of a given target word.

---------
2. Usage
---------

Usage: kocos.pl [OPTIONS] SOURCE WORD

---------
3. Input 
---------

3.1 SOURCE 
-----------
Specify the SOURCE file name on the command line after the program name and 
options (if any) as shown in the usage note. 

SOURCE should be an output(normal or extended) created by count.pl or 
statistic.pl programs for bigrams. When count.pl and statistic.pl are run for 
creating bigrams (--ngram set to 2 or not specified), the programs list the 
bigrams of all words which co-occur together. So we can say that if a bigram 
'word1<>word2<>' is listed in the output of count.pl or statistic.pl program, 
it means that the words word1 and word2 are the co-occurrences.

If you want to run kocos.pl on a SOURCE which is not created by either count 
or statistic program of this package, just make sure that each line of SOURCE 
will list two words WORD1 and WORD2 as 
WORD1<>WORD2<> 
The program minimally requires that there are exactly two words and they are 
separated by delimiter '<>' with an extra delimiter '<>' after the second
word. So you may convert any non NSP input to this format where two words 
occurring in the same context are '<>' separated.  

Controlling scope of the context
--------------------------------
You may like to call two words as co-occurrences of each other if they occur 
within a specific distance from each other. We encourage in this case that you 
use --window w option of NSP program count.pl while creating a SOURCE. This 
will create bigrams of all words which co-occur within a distance w from each 
other. Thus --window w sets the maximum distance allowed between two words to 
call them co-occurrences of each other. 

Note that if the --window option is not used while creating SOURCE, only those
words which come immediately next to each other will be considered as
co-occurrences (default window size being 2 for bigrams).
 

3.2 WORD
----------
Please specify a target WORD whose co-occurrences are to be found after  
the program name, options (if any) and SOURCE file on the command line. 

-----------
4. Options
-----------

4.1 --order K
--------------
If the value of K is specified using the command line option --order K,
kocos.pl will find the Kth order co-occurrences of the WORD. K can take any 
integer value greater than or equal to 0. If the value of K is not specified,
the program will set K to 1 and will simply find the co-occurrences of the
WORD (the word co-occurrence generally means first order co-occurrences). 
{0th order co-occurrence of a WORD is an interesting thought here which means 
the WORD itself.} 

4.2 --trace TRACE_FILE
-----------------------
To see a detailed report of how each Kth order co-occurrence is reached as a 
sequence of K words, specify the name of the TRACE_FILE on the command line 
using --trace TRACE_FILE option. 

TRACE_FILE will show a chain of K+1 words where the first word is the target 
WORD and every ith word in the chain is a (i-1)th order co-occurrence of WORD 
which co-occurs with (i-1)th word in the chain. So a chain of K+1 words, 
WORD->COC1->COC2->COC3....->COCK-1->COCK 
shows that COC1 is a first order co-occurrence of the WORD. 
COC2 is a second order co-occurrence of the WORD such that COC2 co-occurs with 
COC1 which in turn co-occurs with the WORD. 
COC3 is a third order co-occurrence of the WORD such that COC3 co-occurs with
COC2 which in turn co-occurs with COC1 which co-occurs with WORD 

and so on......  

{Viewing the co-occurrence structure as a co-occurrence graph or tree is an 
interesting notion and for further details on this please refer to the source 
code comments in the program kocos.pl before the subroutine 'trace'.}

4.3 --help
-----------
This option will display the help message.

4.4 --version
--------------
This option will display version information of the program.

----------
5. Output 
----------

The program will display a list of Kth order co-occurrences to standard 
output  such that each co-occurrence occurs on a separate line and is 
followed by '<>' (just to be compatible with other programs in NSP).  

Note that the output of kocos.pl could be directly used by the program   
bsp2regex of the SenseTools Package (by Satanjeev Banerjee and Ted  
Pedersen) to convert Senseval data instances into feature vectors in ARFF  
format where our Kth order co-occurrences are used as features. 

For more information on SenseTools you can refer to its README:
http://www.d.umn.edu/~tpederse/Code/Readme.SenseTools-0.1.txt 

------------------
6. Usage examples 
------------------

(a)	Using default value of order 
To find the (1st order) co-occurrences of a word 'line' from the SOURCE file 
test.input run kocos.pl using the following command. 
 	kocos.pl test.input line 

(b)	Using option order 
To find the 2nd order co-occurrences of a word 'line' from the SOURCE file
test.input run kocos.pl using the following command.
	kocos.pl --order 2 test.input line

(c)	Using the trace option
To see how the 4th order co-occurrences of a word 'line' is reached as a 
sequence of words which form a co-occurrence chain, run kocos.pl using the
following command.
	kocos.pl --order 4 --trace test.trace test.input line

--------------------------
7. General Recommendations
--------------------------

(a) Create a SOURCE file using programs count.pl or statistic.pl of the NSP 
    Package. 
(b) Use --window W option of program count.pl to specify the scope of the 
    context. Any word that occurs within a distance W from a target word will be
    treated as its co-occurrence.

---------------------------------------
8. Examples of Kth order co-occurrences
---------------------------------------

In all the following examples, we assume that the input comes from the file 
test.input and word 'line' is a target word. 

test.input => 			
----------------
print<>in<>	|
print<>line<>	|
text<>the<>	|
text<>line<>	|
file<>the<>	|
file<>in<>	|
line<>file	|
----------------
(Note that test.input doesn't look like a valid count/statistic output because 
kocos.pl will minimally require two words WORD1 and WORD2 separated by '<>' 
with an extra '<>' after WORD2 as described in Section 3.1 of this README) 

(a)	The 1st order co-occurrences of word 'line' can be found by 	
running kocos.pl with either of the following commands -
	kocos.pl test.input line	
		OR
	kocos.pl --order 1 test.input line

This will display the co-occurrences of 'line' to standard output as shown
below in the box. 
--------	
text<>	|
file<>	|
print<>	|
--------
This is because the program finds the bigrams 
print<>line<>
text<>line<>
line<>file<> 
where word 'line' co-occurs with the words print, text and file which become 
the 1st order co-occurrences. 

(b)     The 2nd order co-occurrences of word 'line' can be found by 
running kocos.pl with the following command -
        kocos.pl --order 2 test.input line

This will display the 2nd order co-occurrences of 'line' to standard output 
as shown below in the box.
--------
the<>  	|
in<> 	|
--------
This is because the program finds the words print, text and file as the 
first order co-occurrences (as explained in case a) and finds bigrams 
print<>in<>
text<>the<>
file<>the<>
file<>in
where 'the' and 'in' co-occur with the words print, text, file.  

(c)     To see how the 2nd order co-occurrences of word 'line' are reached 
run the program using the following command -
        kocos.pl --order 2 --trace test.trace test.input line

This will display the 2nd order co-occurrences of 'line' to standard output
as shown below in the box.
--------
the<>   |
in<>    |
--------
and a detailed report of co-occurrence chains in test.trace file as shown 
in the box below. 
test.trace =>
----------------
line->text->the	|
line->file->the	|
line->file->in	|
line->print->in	|
----------------
where  
the first line shows that the word 'line' co-occurred with 'text' which
co-occurred with 'the'. Hence 'the' became a 2nd order co-occurrence. 
Similarly, 'line' co-occurred with 'file' which in turn co-occurred with 
'the' and 'in' which are therefore the 2nd order co-occurrences of 'line'.

----------
9. Copying
----------
This suite of programs is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or (at your
option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59 Temple
Place - Suite 330, Boston, MA  02111-1307, USA.

Note: The text of the GNU General Public License is provided in the file
GPL.txt that you should have received with this distribution.

------------------
10. Acknowledgment
------------------
This work has been partially supported by a National Science Foundation
Faculty Early CAREER Development award (#0092784).

--------------
11. References
--------------

[Schutze98] H.Schutze. Automatic word sense discrimination. Computational
Linguistics,24(1):97-123,1998.

[last updated on 01/08/2003 by Amruta]