dc.identifier.uri	http://hdl.handle.net/11401/77285
dc.description.sponsorship	This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.	en_US
dc.format	Monograph
dc.format.medium	Electronic Resource	en_US
dc.language.iso	en_US
dc.publisher	The Graduate School, Stony Brook University: Stony Brook, NY.
dc.type	Dissertation
dcterms.abstract	Today's modeling and analysis of high-dimensional data is either based on human expertise to hand-craft a set of task-specific data, which suffers significantly from the ever-increasing complexity and the unknown patterns of the new data; or is based on simple data-driven approaches which tend to lose the fundamentally physical insights of real world datasets. Therefore, it is very difficult with today's modeling practice to efficiently, effectively, and unsupervisedly detect reliable patterns and information in high-dimensional data. In this dissertation, we developed a scalable data modeling framework that utilizes modern theoretical physics for unsupervised high-dimensional data analysis and mining. Not only does it have a solid theoretical background, but it is capable of handling different tasks with different capability (clustering, anomaly detection and feature selections, etc.). This framework also has probabilistic interpretation that avoids the sensitivity from scaling parameter tuning or noise appearance in real world applications. Furthermore, we presented a fast approximated approach to make such a framework applicable on large-scale datasets with high efficiency and effectiveness. During my dissertation research, we made the following salient contributions: We proposed a diffusion-based Aggregated Heat Kernel (AHK) to improve the clustering stability, and a Local Density Affinity Transformation (LDAT) to correct the bias originated from different cluster densities. Our proposed framework integrates these two techniques systematically. As a result, it not only provides an advanced noise-resisting and density-aware spectral mapping to the original datasets, but also demonstrates the clustering stability during the process of tuning the scaling parameters. We devised a Local Anomaly Descriptor (LAD) that faithfully reveals the intrinsic neighborhood density to detect anomalies. LAD bridges global and local properties, which makes it self-adaptive with different samples' neighborhood. To offer better stability of local density measurement on scaling parameter tuning, we formulated a Fermi Density Descriptor (FDD). FDD steadily distinguishes anomalies from normal instances with most of the scaling parameter settings. We also quantified and examined the effect of different Laplacian normalizations with the purpose of detecting anomalies. We developed a robust feature selection algorithm, called Noise-Resistant Unsupervised Feature Selection (NRFS). It measures multi-perspective correlation that reflects the importance of features with respect to noise-resistant instance representatives and different global trends from spectral decomposition. In this way, the model concisely captures a wide variety of local patterns, and selects representative features with high quality. We mitigated the space and time complexity of spectral embedding in order to apply the above techniques to real-world large data mining, by proposing a Diverse Power Iteration Embedding (DPIE). We tested DPIE on various applications (e.g., clustering, anomaly detection and feature selection). The experimental results showed that our proposed DPIE is more effective than popular spectral approximation methods, and even obtains the similar quality of classic spectral embedding derived from a classic eigen-decompositions. Moreover, DPIE is extremely fast on big data applications. Finally, we provided a brief introduction of our on-going work and future research directions. By elaborating our developed works within the proposed framework, we showed that our scalable physic-based unsupervised data modeling is potent and promising for large-scale and high-dimensional data analysis, data mining, and knowledge discovery. It is a rich and fruitful area for research in terms of both theory and applications.
dcterms.available	2017-09-20T16:52:21Z
dcterms.contributor	Yu, Dantong	en_US
dcterms.contributor	Qin, Hong	en_US
dcterms.contributor	Zhu, Wei	en_US
dcterms.contributor	Ortiz, Luis E	en_US
dcterms.contributor	Hua, Jing.	en_US
dcterms.creator	Huang, Hao
dcterms.dateAccepted	2017-09-20T16:52:21Z
dcterms.dateSubmitted	2017-09-20T16:52:21Z
dcterms.description	Department of Computer Science.	en_US
dcterms.extent	221 pg.	en_US
dcterms.format	Monograph
dcterms.format	Application/PDF	en_US
dcterms.identifier	http://hdl.handle.net/11401/77285
dcterms.issued	2014-12-01
dcterms.language	en_US
dcterms.provenance	Made available in DSpace on 2017-09-20T16:52:21Z (GMT). No. of bitstreams: 1 Huang_grad.sunysb_0771E_12103.pdf: 9212514 bytes, checksum: d3b81f7f9fbd8941461a43e03a47defa (MD5) Previous issue date: 1	en
dcterms.publisher	The Graduate School, Stony Brook University: Stony Brook, NY.
dcterms.subject	Computer science
dcterms.title	A Scalable Physics-based Data Modeling Framework to Unsupervised High-Dimensional Data Mining
dcterms.type	Dissertation

Files in this item

Name:: Huang_grad.sunysb_0771E_12103.pdf
Size:: 8.785Mb
Format:: application/pdf

View/Open

This item appears in the following Collection(s)

Stony Brook Theses and Dissertations Collection [4009]

Show simple item record

A Scalable Physics-based Data Modeling Framework to Unsupervised High-Dimensional Data Mining

Files in this item

This item appears in the following Collection(s)