Roberto Tagliaferri

 

Interactive Machine Learning tools for the analysis of genomic data

In these last years the field of Knowledge Discovery in Databases (KDD) is becoming of great importance for several fields of research. In fact, an explosive growth in the quantity, quality and accessibility of data which is currently experienced in all fields of science and human endeavour, has triggered the search for a new generation of computational theories and tools. They are capable of assisting humans in the extraction of useful information (knowledge) from huge amounts of distributed and heterogeneous data.

At the core of the process there is the application of specific data mining methods for pattern discovery and extraction: in genetics, for example, several data mining approaches are proposed to analyse catalogues obtained from genome sequencing projects, or in astronomy to many classification and regression aims.

In Genetics and Bioinformatics, scientists have been successful in cataloguing genes through genome sequencing projects, and they can now generate vast quantities of gene expression data using microarrays.

Among various application, we recall:

Diagnostic: i.e. to find gene expression patterns specific to given classes (mainly dealt with supervised methods).

Clustering: aimed at grouping genes that are functionally related without attempting to model the underlying biology .

Model-based approach: generation of a model that justifies the grouping of specific genes and trains the parameters of the model on the data set.

Projection methods: which decompose the data set into components that have the desired properties. To this class belong some methods like Principal Components Analysis (PCA), Independent Components Analysis (ICA) and Probabilistic Principal Surfaces (PPS).

Moreover, many of these applications can suffer from poor data visualization techniques.

On the other hand, data visualization is an important mean of extracting useful information from large quantities of raw data. The human eye and brain together make a formidable pattern detection tool, but for them to work the data must be represented in a low-dimensional space, usually two or three dimensions. Even quite simple relationships can seem very obscure when the data is presented in tabular form, they are often very easy to see by visual inspection. Many results in experimental biology first appear in image form a photo of an organism, cells, gels, or microarray scans. As the quantity of these results accelerates, automatic extraction of features and meaning from experimental images becomes crucial.

At the other end of the data pipeline, nave 2D or 3D visualizations alone are inadequate for exploring bioinformatics data. Biologists need a visual environment that facilitates exploring high-dimensional data dependent on many parameters. In this context research needs further work into bioinformatics visualization to develop tools that will meet the upcoming genomic and proteomic challenges. Many algorithms for data visualization have been proposed by both neural computing and statistics communities, most of which are based on a projection of the data onto a two or three dimensional visualization space. We briefly review some of these advanced visualization techniques, and propose an integrated environment for clustering and 2D or 3D visualization of high dimensional biomedical data.