Fionn Murtagh

Topological Approaches to Search and Matching in Massive Data Sets
 

"Approximation has two subtly different aspects: metric and topology. Metric tells how close our ideal point is to a specific wrong one. Topology tells how close it is to the combination of all unacceptable (nonneighboring) points." (L.A. Levin, 2003).

Ultrametricity has been known and used since the 1960s in the context  of hierarchical agglomerative cluster analysis. Clustering is important in the processing of large data stores in order to support retrieval, matching and compression. We study local ultrametricity properties in the data. In so doing, we point to how we are giving priority to topological over metric data analysis.

Our new perspective on data analysis is that p-adic algebraic or ultrametric topological representation (both are equivalent) is important for observational data analysis because of the insights and benefits that such an approach provides.

Firstly, ultrametricity is a pervasive property of observational data. It arises as a limit case when data dimensionality or sparsity grows.

Secondly, practical data sets (derived from, or observed in, databases and data spaces) present some but not exclusively ultrametric characteristics. This can be used for forensic data exploration (i.e., fingerprinting data sets).

Thirdly, of particular interest in this work, it can be used to expedite search and discovery in information spaces.


References

F. Murtagh, On ultrametricity, data coding, and computation, Journal of Classification, 21, 167-184, 2004.

F. Murtagh, Identifying the ultrametricity of time series, European Physical Journal B, 43, 573-579, 2005.

F. Murtagh, Correspondence Analysis and Data Coding with R and Java, Chapman & Hall/CRC, 2005.

F. Murtagh and G. Downs, A topology-based approach to the best match problem: high dimensional clustering through ultrametrization, ms.in preparation, 2006.