Show simple item record

dc.identifier.urihttp://hdl.handle.net/1951/55377
dc.identifier.urihttp://hdl.handle.net/11401/70951
dc.description.sponsorshipThis work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.en_US
dc.formatMonograph
dc.format.mediumElectronic Resourceen_US
dc.language.isoen_US
dc.publisherThe Graduate School, Stony Brook University: Stony Brook, NY.
dc.typeDissertation
dcterms.abstractVariations in DNA sequences of humans have a strong association with many diseases. Single Nucleotide Polymorphism (SNP) is the most common type of DNA variations. Our research is to detect SNPs from the data generated by Polymerase Chain Reaction (PCR) and next generation sequencing methods. In the first part of the study, we had a relatively small data set with fewer known SNPs as the training data. We developed a classification model based on the cross validation method. From the first part of the research, we gained knowledge of the properties of the data. In the next phase, we obtained a much larger data set with a much larger group of known SNPs. We developed eight measures for every genetic position with these data. Using these eight measures as the predictor variables, we applied several classification methods such as Random Forest (RF), Support Vector Machines (SVM), Single Decision Tree (ST) and Logistic Regression (LR); then used cross validation to evaluate these classification methods. By comparing the predictive accuracy, sensitivity and specificity, we found the best performing model for the data. To compare the performances of these models while the number of observations for each genetic position (cover depth) is small, we randomly drew out subsets from the whole data and applied these classification models. Variable selection is also used to our study. The result shows, SVM using the selected variables has a significant higher average accuracy than the other methods in general, but RF using the selected variables performs the best when the cover depth is as small as 20.
dcterms.available2012-05-15T18:02:22Z
dcterms.available2015-04-24T14:45:13Z
dcterms.contributorAhn, Hongshiken_US
dcterms.contributorHongshik Ahnen_US
dcterms.contributorNancy Mendellen_US
dcterms.contributorStephen Finchen_US
dcterms.contributorSangjin Hong.en_US
dcterms.creatorCai, Shengnan
dcterms.dateAccepted2012-05-15T18:02:22Z
dcterms.dateAccepted2015-04-24T14:45:13Z
dcterms.dateSubmitted2012-05-15T18:02:22Z
dcterms.dateSubmitted2015-04-24T14:45:13Z
dcterms.descriptionDepartment of Applied Mathematics and Statisticsen_US
dcterms.formatMonograph
dcterms.formatApplication/PDFen_US
dcterms.identifierCai_grad.sunysb_0771E_10371.pdfen_US
dcterms.identifierhttp://hdl.handle.net/1951/55377
dcterms.identifierhttp://hdl.handle.net/11401/70951
dcterms.issued2010-12-01
dcterms.languageen_US
dcterms.provenanceMade available in DSpace on 2012-05-15T18:02:22Z (GMT). No. of bitstreams: 1 Cai_grad.sunysb_0771E_10371.pdf: 1863725 bytes, checksum: 15e657c9ad160b2db620914e810b33ba (MD5) Previous issue date: 1en
dcterms.provenanceMade available in DSpace on 2015-04-24T14:45:13Z (GMT). No. of bitstreams: 3 Cai_grad.sunysb_0771E_10371.pdf.jpg: 1894 bytes, checksum: a6009c46e6ec8251b348085684cba80d (MD5) Cai_grad.sunysb_0771E_10371.pdf.txt: 117336 bytes, checksum: 90b705b80b14e7299413d9542746063f (MD5) Cai_grad.sunysb_0771E_10371.pdf: 1863725 bytes, checksum: 15e657c9ad160b2db620914e810b33ba (MD5) Previous issue date: 1en
dcterms.publisherThe Graduate School, Stony Brook University: Stony Brook, NY.
dcterms.subjectClassification models, Cross validation, Next generation sequencing, SNP detection, Variable selection
dcterms.subjectStatistics
dcterms.titleStatistical Models for SNP Detection
dcterms.typeDissertation


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record