-
We proposed a sequential greedy clustering algorithm to efficiently cluster a large number of
lexical patterns in our WWW2009 paper.
Download the archive and decompress it.
Inside you will find a python script (seqclust.py) and a sample data file.
To run the clustering algorithm use the following command.
$ python seqclust.py -i input_file -c output_file
input_file must be in a sparse matrix format, where each row starts with the row_id and
the rest of the elements in a row are values of each element in that row.
A colon is used to separate the colum_id from the value. This is the format used by
most machine learning classification programs to specify features for training instances
(row_id is replaced with the class label). Neither row ids nor column ids are required
to be sorted in any particular order. For example, the following line represents the first
row of the matrix, the elements (1,2) being 5.
1 1:10 2:5 6:10 10:12
The columns will be sequentially clustered by the proposed clustering algorithm and the column ids
in each cluster will be written to the output file in CSV format.
A sample data matrix is provided in the above download. To cluster the sample data matrix
execute the following command.
$ python seqclust.py -i wpair.matrix -c clusters.results
|