This research is funded by    
  Supplementary material for the paper:
"Compression-Based Classification of Biological Sequences and Structures via the Universal Similarity Metric: Experimental Assessment"

by Paolo Ferragina, Raffaele Giancarlo, Valentina Greco, Giovanni Manzini and Gabriel Valiente,
BMC Bioinformatics 2007, 8:252 , Research highlight


 


Software:

The software is a multistep approach to classify and cluster Biological Sequences and Structures, via Compression. In the links below, we provide the source code, the executables and the datasets used for the experiments described in the manuscript. The software is released under the GNU General Public License by the Free Software Foundation and is free software.

 

Downloads of the software:

  • Entire package for different platforms (zip format or tar.gz format).

  • Source Code for gcc compiler (Linux/Unix), including Compression Boosting Library, (P. Ferragina, R. Giancarlo, G. Manzini, 2005) (zip format or tar.gz format) (compilation tested on : FreeBSD 6.1-RELEASE i386, Linux Ubuntu 5.10 Kernel 2.6.15.4 i686, Linux Slackware 10.2 kernel 2.6.15.4, Mac Os X 10.4.8 Kernel Darwin 8.8 X86).

  • Binary executable files for Cygwin/Windows (zip format or tar.gz format) (compiled and tested under Cygwin, version 2.05).

  • Binary executable files for Linux/Unix I386 Architecutre(zip format or tar.gz format) (tested on : FreeBSD 6.1-RELEASE i386, Linux Ubuntu 5.10 Kernel 2.6.15.4 i686, Linux Slackware 10.2 kernel 2.6.15.4, ).

  • Binary executable files for Mac Os X (zip format or tar.gz format) (tested on Mac Os X 10.4.8 Kernel Darwin 8.8 X86 ).

  • Documentation file.

     

    Links to datasets:

    Links to available datasets used for the experimentation are provided here. More details about the datasets is provided in the manuscript.

  • CK-36-PDB dataset consisting of 36 amino acid sequences (zip or tar.gz format)

  • CK-36-REL dataset consisting of 36 complete TOPS strings with contact map (zip or tar.gz format)

  • CK-36-SEQ dataset consisting of 36 TOPS strings of secondary structure elements (zip or tar.gz format)

  • SP-86-PDB dataset consisting of 86 amino acid sequences (zip or tar.gz format)

  • SP-86-ATOM dataset consisting of 86 ATOM lines from PDB entries (zip or tar.gz format)

  • AA-15-DNA dataset consisting of 15 mitochondrial DNA sequences (zip or tar.gz format)