Topological Approaches to
Search and Matching in Massive Data Sets
"Approximation has two subtly
different aspects: metric and topology. Metric tells how close our
ideal point is to a specific wrong one. Topology tells how close it
is to the combination of all unacceptable (nonneighboring) points."
(L.A. Levin, 2003).
Ultrametricity has been known and used since the 1960s in the
context of hierarchical agglomerative cluster analysis.
Clustering is important in the processing of large data stores in
order to support retrieval, matching and compression. We study local
ultrametricity properties in the data. In so doing, we point to how
we are giving priority to topological over metric data analysis.
Our new perspective on data analysis is that p-adic algebraic or
ultrametric topological representation (both are equivalent) is
important for observational data analysis because of the insights
and benefits that such an approach provides.
Firstly, ultrametricity is a pervasive property of observational
data. It arises as a limit case when data dimensionality or sparsity
Secondly, practical data sets
(derived from, or observed in, databases
and data spaces) present some but not exclusively ultrametric
characteristics. This can be used for forensic data exploration (i.e.,
fingerprinting data sets).
Thirdly, of particular interest in this work, it can be used to
expedite search and discovery in information spaces.
F. Murtagh, On ultrametricity, data coding, and computation, Journal
of Classification, 21, 167-184, 2004.
F. Murtagh, Identifying the ultrametricity of time series, European
Physical Journal B, 43, 573-579, 2005.
F. Murtagh, Correspondence Analysis and Data Coding with R and Java,
Chapman & Hall/CRC, 2005.
F. Murtagh and G. Downs, A topology-based approach to the best match
problem: high dimensional clustering through ultrametrization, ms.in