dc.identifier.uri	http://hdl.handle.net/11401/78335
dc.description.sponsorship	This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree	en_US
dc.format	Monograph
dc.format.medium	Electronic Resource	en_US
dc.format.mimetype	Application/PDF	en_US
dc.language.iso	en_US
dc.type	Dissertation
dcterms.abstract	One fundamental question in computational genomics is to understand the relationship between genotype and phenotype. In this dissertation, I developed graphical and machine learning algorithms for large-scale genomics data, allowing accurate genotyping and molecular phenotype quantification. This work has helped to shed new light on the genetic contributions to autism spectrum disorders, intellectual disability, and other psychiatric disorders, as well as enabled detailed analysis of the molecular biology of several model organisms. The first major theme of my research has been in the study of genomic variations, in particular insertion and deletion (indel) mutations. As the second most common type of variations in the human genome, indels have been linked to many diseases, but indels of more than a few bases are still challenging to discover from short-read sequencing data. We present an open-source algorithm, Scalpel, which combines mapping and assembly for sensitive and specific discovery of indels. A detailed repeat analysis coupled with a self-tuning k-mer strategy allows Scalpel to outperform other state-of-the-art approaches for indel discovery, particularly in regions containing near-perfect repeats. We characterized various types of sequencing data to investigate the sources of indel errors. We also developed a classification scheme to rank high and low quality calls. In a second major theme of research, I present new methods for analyzing ribosome profiling (Riboseq) data, a powerful technique for monitoring protein translation in vivo. This, combined with detailed genomic variation data allows researchers to study how the genome influences transcription, translation, and ultimately the overall phenotype of an organism. However, there are prevalent sampling and biological biases in Riboseq data, limiting our ability to understand translation control. To tackle these issues, I developed Scikit-ribo, the first open-source software for accurate genome-wide inference of translation efficiency (TE) and A-site prediction. Scikit-ribo accurately identifies ribosome A-site locations even with different mRNA digestion protocols and nearly perfectly reproduces the codon elongation rates in several datasets (r=0.99). Next we show the commonly used RPKM-derived TE is very sensitive to sampling errors and biological biases, skewing the TE estimates in all previous studies. To address this, I developed a codon level generalized linear model with ridge penalty to correctly estimate TE while inferring codon elongation rates and mRNA secondary structure. We performed a large-scale validation using mass spectrometry data of 1200 genes and showed very high correlation. Scikit-ribo is particularly robust to low abundance genes that are most commonly distorted by lesser approaches and successfully corrected the TE biases for more than 2000 genes in S. cerevisiae. These improvements allow us to discover the Kozak-like consensus sequence in S. cerevisiae and a previously undiscovered biological significance in the Dhh1p study. Together, these results show that Scikit-ribo substantially improves Riboseq analysis and deepens the understanding of translation control.
dcterms.available	2018-07-09T13:34:18Z
dcterms.contributor	Schatz, Michael C.	en_US
dcterms.contributor	Lyon, Gholson J.	en_US
dcterms.contributor	Wu, Song	en_US
dcterms.contributor	MacCarthy, Tom	en_US
dcterms.contributor	Patro, Rob.	en_US
dcterms.creator	Fang, Han
dcterms.dateAccepted	2018-07-09T13:34:18Z
dcterms.dateSubmitted	2018-07-09T13:34:18Z
dcterms.description	Department of Applied Mathematics and Statistics.	en_US
dcterms.extent	130 pg.	en_US
dcterms.format	Monograph
dcterms.identifier	http://hdl.handle.net/11401/78335
dcterms.identifier	Fang_grad.sunysb_0771E_13370.pdf	en_US
dcterms.issued	2017-08-01
dcterms.language	en_US
dcterms.provenance	Submitted by Jason Torre (fjason.torre@stonybrook.edu) on 2018-07-09T13:34:18Z No. of bitstreams: 1 Fang_grad.sunysb_0771E_13370.pdf: 9403941 bytes, checksum: bf64b3fc8ccc2520115bf919dfaef501 (MD5)	en
dcterms.provenance	Made available in DSpace on 2018-07-09T13:34:18Z (GMT). No. of bitstreams: 1 Fang_grad.sunysb_0771E_13370.pdf: 9403941 bytes, checksum: bf64b3fc8ccc2520115bf919dfaef501 (MD5) Previous issue date: 2017-08-01	en
dcterms.subject	Statistics
dcterms.subject	Genetics
dcterms.title	Graphical and machine learning algorithms for large-scale genomics data
dcterms.type	Dissertation

Files in this item

Name:: Fang_grad.sunysb_0771E_13370.pdf
Size:: 8.968Mb
Format:: application/pdf

View/Open

This item appears in the following Collection(s)

Stony Brook Theses and Dissertations Collection [4009]

Show simple item record

Graphical and machine learning algorithms for large-scale genomics data

Files in this item

This item appears in the following Collection(s)