dcterms.abstract | Today's modeling and analysis of high-dimensional data is either based on human expertise to hand-craft a set of task-specific data, which suffers significantly from the ever-increasing complexity and the unknown patterns of the new data; or is based on simple data-driven approaches which tend to lose the fundamentally physical insights of real world datasets. Therefore, it is very difficult with today's modeling practice to efficiently, effectively, and unsupervisedly detect reliable patterns and information in high-dimensional data. In this dissertation, we developed a scalable data modeling framework that utilizes modern theoretical physics for unsupervised high-dimensional data analysis and mining. Not only does it have a solid theoretical background, but it is capable of handling different tasks with different capability (clustering, anomaly detection and feature selections, etc.). This framework also has probabilistic interpretation that avoids the sensitivity from scaling parameter tuning or noise appearance in real world applications. Furthermore, we presented a fast approximated approach to make such a framework applicable on large-scale datasets with high efficiency and effectiveness. During my dissertation research, we made the following salient contributions: We proposed a diffusion-based Aggregated Heat Kernel (AHK) to improve the clustering stability, and a Local Density Affinity Transformation (LDAT) to correct the bias originated from different cluster densities. Our proposed framework integrates these two techniques systematically. As a result, it not only provides an advanced noise-resisting and density-aware spectral mapping to the original datasets, but also demonstrates the clustering stability during the process of tuning the scaling parameters. We devised a Local Anomaly Descriptor (LAD) that faithfully reveals the intrinsic neighborhood density to detect anomalies. LAD bridges global and local properties, which makes it self-adaptive with different samples' neighborhood. To offer better stability of local density measurement on scaling parameter tuning, we formulated a Fermi Density Descriptor (FDD). FDD steadily distinguishes anomalies from normal instances with most of the scaling parameter settings. We also quantified and examined the effect of different Laplacian normalizations with the purpose of detecting anomalies. We developed a robust feature selection algorithm, called Noise-Resistant Unsupervised Feature Selection (NRFS). It measures multi-perspective correlation that reflects the importance of features with respect to noise-resistant instance representatives and different global trends from spectral decomposition. In this way, the model concisely captures a wide variety of local patterns, and selects representative features with high quality. We mitigated the space and time complexity of spectral embedding in order to apply the above techniques to real-world large data mining, by proposing a Diverse Power Iteration Embedding (DPIE). We tested DPIE on various applications (e.g., clustering, anomaly detection and feature selection). The experimental results showed that our proposed DPIE is more effective than popular spectral approximation methods, and even obtains the similar quality of classic spectral embedding derived from a classic eigen-decompositions. Moreover, DPIE is extremely fast on big data applications. Finally, we provided a brief introduction of our on-going work and future research directions. By elaborating our developed works within the proposed framework, we showed that our scalable physic-based unsupervised data modeling is potent and promising for large-scale and high-dimensional data analysis, data mining, and knowledge discovery. It is a rich and fruitful area for research in terms of both theory and applications. | |