Show simple item record

dc.identifier.urihttp://hdl.handle.net/11401/78335
dc.description.sponsorshipThis work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degreeen_US
dc.formatMonograph
dc.format.mediumElectronic Resourceen_US
dc.format.mimetypeApplication/PDFen_US
dc.language.isoen_US
dc.typeDissertation
dcterms.abstractOne fundamental question in computational genomics is to understand the relationship between genotype and phenotype. In this dissertation, I developed graphical and machine learning algorithms for large-scale genomics data, allowing accurate genotyping and molecular phenotype quantification. This work has helped to shed new light on the genetic contributions to autism spectrum disorders, intellectual disability, and other psychiatric disorders, as well as enabled detailed analysis of the molecular biology of several model organisms. The first major theme of my research has been in the study of genomic variations, in particular insertion and deletion (indel) mutations. As the second most common type of variations in the human genome, indels have been linked to many diseases, but indels of more than a few bases are still challenging to discover from short-read sequencing data. We present an open-source algorithm, Scalpel, which combines mapping and assembly for sensitive and specific discovery of indels. A detailed repeat analysis coupled with a self-tuning k-mer strategy allows Scalpel to outperform other state-of-the-art approaches for indel discovery, particularly in regions containing near-perfect repeats. We characterized various types of sequencing data to investigate the sources of indel errors. We also developed a classification scheme to rank high and low quality calls. In a second major theme of research, I present new methods for analyzing ribosome profiling (Riboseq) data, a powerful technique for monitoring protein translation in vivo. This, combined with detailed genomic variation data allows researchers to study how the genome influences transcription, translation, and ultimately the overall phenotype of an organism. However, there are prevalent sampling and biological biases in Riboseq data, limiting our ability to understand translation control. To tackle these issues, I developed Scikit-ribo, the first open-source software for accurate genome-wide inference of translation efficiency (TE) and A-site prediction. Scikit-ribo accurately identifies ribosome A-site locations even with different mRNA digestion protocols and nearly perfectly reproduces the codon elongation rates in several datasets (r=0.99). Next we show the commonly used RPKM-derived TE is very sensitive to sampling errors and biological biases, skewing the TE estimates in all previous studies. To address this, I developed a codon level generalized linear model with ridge penalty to correctly estimate TE while inferring codon elongation rates and mRNA secondary structure. We performed a large-scale validation using mass spectrometry data of 1200 genes and showed very high correlation. Scikit-ribo is particularly robust to low abundance genes that are most commonly distorted by lesser approaches and successfully corrected the TE biases for more than 2000 genes in S. cerevisiae. These improvements allow us to discover the Kozak-like consensus sequence in S. cerevisiae and a previously undiscovered biological significance in the Dhh1p study. Together, these results show that Scikit-ribo substantially improves Riboseq analysis and deepens the understanding of translation control.
dcterms.available2018-07-09T13:34:18Z
dcterms.contributorSchatz, Michael C.en_US
dcterms.contributorLyon, Gholson J.en_US
dcterms.contributorWu, Songen_US
dcterms.contributorMacCarthy, Tomen_US
dcterms.contributorPatro, Rob.en_US
dcterms.creatorFang, Han
dcterms.dateAccepted2018-07-09T13:34:18Z
dcterms.dateSubmitted2018-07-09T13:34:18Z
dcterms.descriptionDepartment of Applied Mathematics and Statistics.en_US
dcterms.extent130 pg.en_US
dcterms.formatMonograph
dcterms.identifierhttp://hdl.handle.net/11401/78335
dcterms.identifierFang_grad.sunysb_0771E_13370.pdfen_US
dcterms.issued2017-08-01
dcterms.languageen_US
dcterms.provenanceSubmitted by Jason Torre (fjason.torre@stonybrook.edu) on 2018-07-09T13:34:18Z No. of bitstreams: 1 Fang_grad.sunysb_0771E_13370.pdf: 9403941 bytes, checksum: bf64b3fc8ccc2520115bf919dfaef501 (MD5)en
dcterms.provenanceMade available in DSpace on 2018-07-09T13:34:18Z (GMT). No. of bitstreams: 1 Fang_grad.sunysb_0771E_13370.pdf: 9403941 bytes, checksum: bf64b3fc8ccc2520115bf919dfaef501 (MD5) Previous issue date: 2017-08-01en
dcterms.subjectStatistics
dcterms.subjectGenetics
dcterms.titleGraphical and machine learning algorithms for large-scale genomics data
dcterms.typeDissertation


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record