Development and Application of an Integrated Parallel Platform on Short–read Sequences Assembly

dc.identifier.uri	http://hdl.handle.net/11401/77144
dc.description.sponsorship	This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.	en_US
dc.format	Monograph
dc.format.medium	Electronic Resource	en_US
dc.language.iso	en_US
dc.publisher	The Graduate School, Stony Brook University: Stony Brook, NY.
dc.type	Dissertation
dcterms.abstract	Rapid and automated next generation sequencing (NGS) methods have emerged recently and significantly accelerated the research in biological and medical fields. The high-throughput NGS usually generates billions of shorter reads, which poses great bioinformatics challenges on extracting meaningful information from these massive data, one of which is de novo assembly. At the same time, the fast development of massive parallel processing (MPP) systems presents a substantial opportunity for processing larger datasets. Therefore, using supercomputer innovations on NGS research promises a good strategy; however, this application is not straightforward and requires new algorithms and parallel design for efficient implementations. In this thesis, we develop and present PPLAT, an integrated hierarchical multitasking parallel platform framework, and PPASSEM, a novel genome assembler built on PPLAT. PPLAT is designed for distributed storage and distributed processing of big data by enabling asynchronous computing and message passing, and provides a hybrid of multithreading- and MPI-based solution for MPP systems with simple APIs and great flexibility. We demonstrate the power of PPLAT to significantly reduce the coding and debugging complexity as well as facilitate high performance of derived parallel programs. PPASSEM is a novel application built on PPLAT, which employs the small-scale shared-memory multithreading and the large-scale distributed-memory parallelism using de Bruijn graph data structure for short–read sequences data. Our parallel platform has been tested on commodity computer clusters, based on both simulated and real data. Our results show that PPLAT can effectively handle billions of short reads (~500GB), and PPASSEM can generate accurate assembly constructs with much less time, compared with other well-known benchmark assembler like ABySS and PASHA. As new additions to the existing NGS toolbox, we expected that PPLAT and PPASSEM will greatly facilitate the future NGS-based research.
dcterms.abstract	Rapid and automated next generation sequencing (NGS) methods have emerged recently and significantly accelerated the research in biological and medical fields. The high-throughput NGS usually generates billions of shorter reads, which poses great bioinformatics challenges on extracting meaningful information from these massive data, one of which is de novo assembly. At the same time, the fast development of massive parallel processing (MPP) systems presents a substantial opportunity for processing larger datasets. Therefore, using supercomputer innovations on NGS research promises a good strategy; however, this application is not straightforward and requires new algorithms and parallel design for efficient implementations. In this thesis, we develop and present PPLAT, an integrated hierarchical multitasking parallel platform framework, and PPASSEM, a novel genome assembler built on PPLAT. PPLAT is designed for distributed storage and distributed processing of big data by enabling asynchronous computing and message passing, and provides a hybrid of multithreading- and MPI-based solution for MPP systems with simple APIs and great flexibility. We demonstrate the power of PPLAT to significantly reduce the coding and debugging complexity as well as facilitate high performance of derived parallel programs. PPASSEM is a novel application built on PPLAT, which employs the small-scale shared-memory multithreading and the large-scale distributed-memory parallelism using de Bruijn graph data structure for shortâ€“read sequences data. Our parallel platform has been tested on commodity computer clusters, based on both simulated and real data. Our results show that PPLAT can effectively handle billions of short reads (~500GB), and PPASSEM can generate accurate assembly constructs with much less time, compared with other well-known benchmark assembler like ABySS and PASHA. As new additions to the existing NGS toolbox, we expected that PPLAT and PPASSEM will greatly facilitate the future NGS-based research.
dcterms.available	2017-09-20T16:52:04Z
dcterms.contributor	Zhu, Wei	en_US
dcterms.contributor	Wu, Song	en_US
dcterms.contributor	Wang, Xuefeng	en_US
dcterms.contributor	Jia, Jiangyong.	en_US
dcterms.creator	He, Fei
dcterms.dateAccepted	2017-09-20T16:52:04Z
dcterms.dateSubmitted	2017-09-20T16:52:04Z
dcterms.description	Department of Applied Mathematics and Statistics	en_US
dcterms.extent	93 pg.	en_US
dcterms.format	Monograph
dcterms.format	Application/PDF	en_US
dcterms.identifier	http://hdl.handle.net/11401/77144
dcterms.issued	2016-12-01
dcterms.language	en_US
dcterms.provenance	Made available in DSpace on 2017-09-20T16:52:04Z (GMT). No. of bitstreams: 1 He_grad.sunysb_0771E_12795.pdf: 1620315 bytes, checksum: 18047fb060ee6ae705f47cf6478a3597 (MD5) Previous issue date: 1	en
dcterms.publisher	The Graduate School, Stony Brook University: Stony Brook, NY.
dcterms.subject	Applied mathematics -- Bioinformatics
dcterms.title	Development and Application of an Integrated Parallel Platform on Shortâ€“read Sequences Assembly
dcterms.title	Development and Application of an Integrated Parallel Platform on Short–read Sequences Assembly
dcterms.type	Dissertation

Files in this item

Name:: He_grad.sunysb_0771E_12795.pdf
Size:: 1.545Mb
Format:: application/pdf

View/Open

This item appears in the following Collection(s)

Stony Brook Theses and Dissertations Collection [4009]

Show simple item record

Development and Application of an Integrated Parallel Platform on Shortâ€“read Sequences Assembly

Files in this item

This item appears in the following Collection(s)