NSP To Do List	

The following list describes some of the features that we'd like to 
include in NSP in future. No particular priority is assigned to these 
items - they are all things we've discussed amongst ourselves or with 
users and agree would be good to add. 

If you have additional ideas, or would like to comment on something on the 
current list, please let me know at tpederse@umn.edu. 

============================================================================

MORE EFFICIENT COUNTING

Right now all the ngrams being counted are stored in memory. Each ngram is 
an element in a hash. This is ok for up to a few million word corpora,
but after that things really slow down. We would like to pursue the idea 
of using suffix trees which would greatly improve space utilization. 

The use of suffix trees for counting term frequencies is based on :
	
Yamamoto, M. and Church, K (2001) “Using Suffix Arrays to compute Term 
Frequency and Document Frequency for All Substrings in a Corpus,” 
Computational Linguistics, vol 27:1, pp. 1-30, MIT Press.

Find the article at:

http://acl.ldc.upenn.edu/J/J01/J01-1001.pdf
http://www.research.att.com/~kwc/CL_suffix_array.pdf

In fact, they even provide a C implementation:

http://www.milab.is.tsukuba.ac.jp/~myama/tfdf/index.html

However, we would convert this into Perl and may need to modify it 
somewhat to fit into NSP. 

Another alternative would be to simply modify the count.pl program such 
that rather than using memory it used disk space to accumulate counts. 
This would be very slow but might suffice for certain situations.

Regardless of the changes we make to counting,  would continue to support 
counting in memory, which is perfectly  adequate for smaller amounts of  
corpora. 

============================================================================

GET COUNTS FROM WEB

The web is a huge source of text, and we could get counts for words 
or ngrams from the web (probably using something like Perl LWP module). 

Rather than running count on a particular body of text (as is the case 
now) we'd probably have to run count such that it looked for counts for a 
specific set of words as found on the web. Simply running count.pl on the  
entire www wouldn't really make sense. So perhaps we would run count on  
one sample to get a list of the word types/ngrams that we are interested  
in, and then run count on the www to find out their respective counts.

[Our interest in this has been inspired by both Peter Turney (ACL-02 
paper) and Frank Keller (EMNLP-02 paper).] 

============================================================================

UNICODE SUPPORT

NSP is geared for the Roman alphabet. Perl has increasingly better Unicode 
support with each passing release, and we will incorporate Unicode support 
in future. We attempted to use the Unicode features in Perl 5.6, but found
them to be incomplete. We have not yet attempted this with Perl 5.8 (the
now current version) but it is said to be considerably better. 

Perl support for unicode will include language / alphabet specific   
definitions of regular expression character classes like \d+ or \w+  
(digits and non-white space characters). So you should be able to use
(in theory) the same regular expression definitions with any alphabet
and have it match in a way that makes sense for that language.

Our expertise in this area is fairly limited, so please let us know if
we are missing something obvious or misunderstanding what Perl is
attempting to do. 

============================================================================

PROGRESS METER for count.pl

When processing large files, count.pl gives no indication of how much of 
the file has been processed, or even if it is still making progress. A 
"progress meter" could show how much of the file has been proceeded, or
how many ngrams have been counted, or something to indicate that progress 
is being made. 

============================================================================

OVERLY LONG LINE DETECTOR for count.pl

If count.pl encounters a very long line of text (with literally thousands 
and thousands of words on a single line) it may operate very very slowly. 
It would be good to let a user know that an overly long line (we'd  
need to define more precisely what "overly long" is) is being processed 
(this fits into the progress meter mentioned above) so that a user can 
decide if they want to continue with this, or possibly terminate 
processing and reformat the input file.

============================================================================

GENERALIZE --newLine in count.pl

The --newLine switch tells count.pl that Ngrams may not cross over end of 
line markers. Presumably this would be used when each line of text 
consists of a sentence (thus the end of a line also marks the end of
a sentence). However, if the text is not formatted and there may be 
multiple sentences per line, or sentences may extend across several lines, 
we may want to allow --newLine to include other characters that Ngrams
would not be allowed to cross. 

For example we could have the switch --dontCross "\n\.,;\?" which would  
prevent ngrams from crossing the newline, the fullstop, the comma, the 
semicolon and the question mark.

============================================================================

ERROR CHECKS in statistic.pl

statistic.pl does not check that sample size equals the sum of values 
in the first column (the ngram counts). 

For example, the file below shows that there are 100 bigrams in the
sample. But if you sum the values of the number of times "in the" and
"on my" occur, you get 110! 

100
in<>the<>90 100 200
on<>my<>20 30 40

We could also check that the counts of the individual words in the Ngram 
do not exceed that of the Ngram itself. For example,

100
in<>the<>90 10 20
on<>my<>20 30 40

This says that the bigram "in the" occurs 90 times, but that "in" 
is the first word in bigram 10 times, and "the" is the second word
in a bigram 20 times. This can't be.

Both of these would be nice to check, and would act as a safety net for  
count.pl.

============================================================================

RECURSE LIKE OPTION THAT CREATES MULTIPLE COUNT FILES

Our current --recurse option creates a single count output file for
all the words in all the texts found in a directory structure. We might
want to be able to process all the files in a directory structure such
that each file is treated separately and a separate count file is
created for it. 

For example, suppose we have the directory /txts that contains the
files text1 and text2. 

count.pl --recurse output txts

output will consist of the combined counts from txts/text1 and txts/text2.

This new option would count these files separately and produce separate 
count output files. 

============================================================================

OTHER CUTOFFS FOR count.pl

What about having a frequency cutoff for count.pl that removed any ngrams 
that occur more than some number of times? The idea here would be to 
eliminate high frequency ngrams not through the use of a stoplist but 
rather through a frequency cutoff, based on the presumption that most very 
high frequent ngrams will be made up of stop words. 

What about a percentage cutoff? In other words, eliminate the least (or 
most) frequent ngrams? 

============================================================================

Last updated 1/14/03 by TDP