Mündliche Doktorprüfung von Herrn Dipl.-Inf. Alexander Hasenfuß am Dienstag, 22.09.2009, 10 Uhr s.t. im Seminarraum 106
Topographic Mapping of Dissimilarty Datasets
Abstract
A great challenge today, arising in many elds of science, is the proper mapping of datasets to explore their structure and gain information that otherwise would remain concealed due to the high-dimensionality. This task is impossible without appropriate tools helping the experts to understand the data. A promising way to support the experts in their work is the topographic mapping of the datasets to a low-dimensional space where the structure of the data can be visualized and understood.
This thesis focuses on Neural Gas and Self-Organizing Maps as particularly successful methods for prototype-based topographic maps. The aim of the thesis is to extend these methods such that they can deal with real life datasets which are possibly very huge and complex, thus probably not treatable in main memory, nor embeddable in Euclidean space. As a foundation, we propose and investigate a fast batch scheme for topographic mapping which features quadratic convergence. This formulation allows to extend the methods to general non-Euclidean settings in two ways, on the one hand by restricting prototype locations to data points, leading to so-called median variants. On the other hand, continuous prototype updates become possible by means of an equivalent formulation of the methods in terms of pairwise dissimilarities only and the notation of generalized relational prototypes, leading to so-called relational variants. Since the methods rely on the standard cost functions of Neural Gas and Self-Organizing Maps (in the version of Heskes), further extensions to incorporate auxiliary information in terms of labels, and to control the magnication exponent of the resulting prototype distribution become possible. The dependency of the models on prototypes allows to include an intuitive patch processing scheme which turns the basic algorithms into a framework which requires only constant memory and linear time complexity also for the case of general (quadratic) dissimilarity matrices. The suitability of these methods is demonstrated in several experiments including an application to text data for which the dissimilarity matrix requires almost 250 GB storage capacity.