----------------------------------------------------------------------------------------- File: Documentation Author: Valentina Greco Date last modified: December 22, 2006 ----------------------------------------------------------------------------------------- This file contains instructions for getting a programmer or a user started with the software. PREREQUISITES: -------------- In order to be able to use this package, one first needs to install the following (if they are not already installed): - Perl - Bioperl (Bioperl v. 1.4 is provided in the sources folder) - Necessary Perl modules (also provided in the sources folder) NOTE: In order to use the software under Windows, one first needs to download and install Cygwin, which is linux-like environment for Windows. It is available for free at www.cygwin.com. In order to get Bioperl running under Cygwin, one first needs to install the basic Cygwin perl, make and gcc packages. Then one needs to follow the bioperl installation instructions for Unix in Bioperl's INSTALL file. ------------------------------------------------------------------------------------ Compiling source code of the boosting library: Open a terminal window, move to the directory ../sources/booster and compile it just by typing the command: - make The executable file, named bst, will be created in the same directory of the source. NOTE: One do not necessarily needs to compile the source code, since we provide executables for different platforms. HOW TO USE THE SOFTWARE ----------------------- First of all one needs to set up the gold standard and the dataset into the working folder (binaries/Unix, binaries/MacOs or binaries/Cygwin depending on the platform). 1) Preprocessing To start the preprocessing step move to the directory '../binaries/Unix' or '../binaries/MacOs' or '../binaries/Cygwin' and type: perl preproc.pl <"compressor parameters"> ----------------------------------------------------------------------------------- A list of compressors follows. NOTE: A detailed description of all the compressors' parameters can be obtained by typying <./bst> without arguments. Gzip <"gzip -c"> Bzip2 <"bzip2 -c"> MtfRleMth <"./bst -a6 -fm -O"> RleRc fast <"./bst -a7 -f -y65536 -z256 -O"> MtfRleRc fast <"./bst -a8 -fm -y65536 -z256 -O"> BoostRleRc fast <"./bst -a7 -c1 -y65536 -z256 -O"> RleRc med <"./bst -a7 -f -y65536 -z32 -O"> MtfRleRc med <"./bst -a8 -fm -y65536 -z32 -O"> BoostRleRc med <"./bst -a7 -c1 -y65536 -z32 -O"> RleRc slow <"./bst -a7 -f -y65536 -z4 -O"> MtfRleRc slow <"./bst -a8 -fm -y65536 -z4 -O"> BoostRleRc slow <"./bst -a7 -c1 -y65536 -z4 -O"> RleAc fast <"./bst -a1 -f -y16384 -z64 -O"> MtfRleAc fast <"./bst -a2 -fm -y16384 -z64 -O"> BoostRleAc fast <"./bst -a1 -c1 -y16384 -z64 -O"> RleHuff <"./bst -a10 -f -O"> MtfRleHuff <"./bst -a11 -fm -O"> BoostRleHuff <"./bst -a10 -c1 -O"> Wavelet <"./bst -a4 -f -O"> BoostWav <"./bst -a4 -c1 -O"> Rc fast <"./bst -a9 -y65536 -z256 -n -O"> Rc med <"./bst -a9 -y65536 -z32 -n -O"> Rc slow <"./bst -a9 -y65536 -z8 -n -O"> Ac fast <"./bst -a3 -y16384 -z64 -n -O"> Ac med <"./bst -a3 -y16384 -z8 -n -O"> Ac slow <"./bst -a3 -y16384 -z1 -n -O"> Huffman <"./bst -a12 -n -O"> ----------------------------------------------------------------------------------- In this first step three differents text files are given in output. They are: - file_names - compression_values - cat_compression_values NOTE: It is also possible to test the software with two more compression algorithms but with some restrictions. Both of them can be executed only under Unix. Gencompress to compress DNA sequences and PPMd. The executable file of gencompress is available at www.cs.cityu.edu.hk/~cssamk/gencomp/GenCompress1.htm To use it just type: - perl preproc.pl <"./gencompress"> > out To use PPMd, an extra script, named preprocPPMd.pl has been produced. To be able to preprocess the dataset with PPMd, one first needs to move the dataset, the asterisk file and the PPMd executive file into the ../binaries/Unix/dataset folder then one have to open the preprocPPMd.pl file and changes the path. FInally move to the directory ../binaries/Unix/dataset and type: - perl preprocPPMd.pl <"./PPMd e -o16 -m256 -r1"> > 2. Similarity matrix construction To obtain the similarity matrix, one first needs to choose the similarity function to use among three differents ones and then type: perl ucd.pl > or perl ncd.pl > or perl cd.pl > where data.mat is the similarity matrix obtained. 3. Classification Two different classification algorithms for tree constructions have been used named UPGMA and Neighbor Joining. In order to use both of them type: perl mat-to-tree.pl > and perl mat-to-tree.pl > The output is a tree in Newick format. 4. Evaluation 4.1 F-measure To evaluate the performance of the compression based classification with the F-measure type: - perl f-measure.pl The output is a value ranging from zero for highest dissimilarity to one for identical classification. 4.2 Partition distance To compute the topologic distance between two rooted trees using the tree-dist-sym-dif.pl script one first needs to create a text file with two lines where the first one is the classification output tree and the second one is the gold tree. Let be intree the created file, the partition distance will be obtained typing: - perl tree-dist-sym-dif.pl