README for socs.pl program 
  			  for NSP Package Version 0.53
			 ===============================
			
 				     socs.pl 	
				   Version 0.01
			    Copyright (C) 2002-2003
			Amruta Purandare, pura0010@umn.edu
                          Ted Pedersen, tpederse@umn.edu
                         University of Minnesota, Duluth

----------------
1. Introduction
----------------
This program finds the 2nd order co-occurrences of a given target word. 

---------------------------------------
1.1 What are 2nd order Co-occurrences ? 
---------------------------------------
Co-occurrences are the words which occur together in the same context. All 
words which co-occur with a given target word are called its co-occurrences. 
The concept of 2nd order co-occurrences is explained in the paper Automatic 
word Sense Discrimination [Schutze98]. According to this paper, the words 
which co-occur with the co-occurring words of a target word are called its 
2nd order co-occurrences. 

We implement in this program this concept to find 2nd order co-occurrences
of a given word from the bigram output created by programs count/statistic.pl
of NSP. 

---------
2. Usage
---------

Usage: socs.pl [OPTIONS] SOURCE WORD

---------
3. Input 
---------
-----------
3.1 SOURCE 
-----------
Specify the SOURCE file name on the command line after the program name as 
shown in the usage note. 

SOURCE should be an output(normal or extended) created by count.pl or 
statistic.pl programs for bigrams. When count.pl and statistic.pl are run for 
creating bigrams (--ngram set to 2 or not specified), the programs list the 
bigrams of all words which co-occur together. So we can say that if a bigram 
'word1<>word2<>' is listed in the output of count.pl or statistic.pl program, 
it means that the words word1 and word2 are the co-occurrences.

If you want to run socs.pl on a SOURCE which is not created by either count 
or statistic program of this package, just make sure that each line of SOURCE 
will list two words WORD1 and WORD2 as 
WORD1<>WORD2<> 

The program minimally requires that there are exactly two words and they are 
separated by delimiter '<>' with an extra delimiter '<>' after the second
word. So you may convert any non NSP input to this format where two words 
occurring in the same context are '<>' separated.  

--------------------------------
Controlling scope of the context
--------------------------------
You may like to call two words as co-occurrences of each other if they occur 
within a specific distance from each other. We encourage in this case that you 
use --window w option of NSP program count.pl while creating a SOURCE. This 
will create bigrams of all words which co-occur within a distance w from each 
other. Thus --window w sets the maximum distance allowed between two words to 
call them co-occurrences of each other. 

Note that if the --window option is not used while creating SOURCE, only those
words which come immediately next to each other will be considered as
co-occurrences (default window size being 2 for bigrams).
 
----------
3.2 WORD
----------
Please specify the target WORD whose 2nd order co-occurrences are to be found 
after the program name and SOURCE file on the command line. 

-----------
4. Options
-----------
-----------
4.1 --help
-----------
This option will display the help message.

--------------
4.2 --version
--------------
This option will display version information of the program.

----------
5. Output 
----------

The program will display a list of 2nd order co-occurrences to standard output 
such that each co-occurrence occurs on a separate line and is followed by '<>' 
(just to be compatible with other programs in the NSP). 

{ You may like to know that -
Output of socs.pl could be directly used by the program bsp2regex of the
SenseTools Package (by Satanjeev Banerjee and Ted Pedersen) to convert Senseval
data instances into feature vectors in ARFF format where our 2nd order 
co-occurrences are used as features. For more information on this package you 
may read the README of SenseTools Package at 
http://www.d.umn.edu/~tpederse/Code/Readme.SenseTools-0.1.txt }

------------------
6. Usage examples 
------------------

To find the 2nd order co-occurrences of a word 'line' from the SOURCE file 
test.input run socs.pl using the following command 

 	socs.pl test.input line 

--------------------------
7. General Recommendations
--------------------------
(a) Create a SOURCE file using programs count.pl or statistic.pl of the NSP 
    Package.
(b) Use --window W option of program count.pl to specify the scope of the 
    context. Any word that occurs within a distance W from a target word will be
    treated as its co-occurrence.

---------------------------------------
8. Examples of 2nd order co-occurrences
---------------------------------------

test.input => 			
----------------
print<>in<>	|
print<>line<>	|
text<>the<>	|
text<>line<>	|
file<>the<>	|
file<>in<>	|
line<>file	|
----------------
(Note that test.input doesn't look like a valid count/statistic output because 
socs.pl will minimally require two words WORD1 and WORD2 separated by '<>' 
with an extra '<>' after WORD2 as described in Section 3.1 of this README) 

The 2nd order co-occurrences of word 'line' can be found by running socs.pl 
with the following command -
        socs.pl test.input line

This will display the 2nd order co-occurrences of 'line' to standard output 
as shown below in the box.
--------
the<>  	|
in<> 	|
--------
This is because the program finds the bigrams 
print<>line
text<>line
line<>file
where 'line' co-occurs with words print, text and file. 

The program also finds the bigrams
print<>in<>
text<>the<>
file<>the<>
file<>in
where 'the' and 'in' co-occur with the words print, text, file.  

-------------
9. Copyright
-------------
This suite of programs is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or (at your
option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 59 Temple
Place - Suite 330, Boston, MA  02111-1307, USA.

Note: The text of the GNU General Public License is provided in the file
GPL.txt that you should have received with this distribution.

------------------
10. Acknowledgment
------------------
This work has been partially supported by a National Science Foundation
Faculty Early CAREER Development award (#0092784).

--------------
11. References
--------------

[Schutze98] H.Schutze. Automatic word sense discrimination. Computational
Linguistics,24(1):97-123,1998.

(README.socs.txt last updated on 01/13/2003 by Amruta)