Show simple item record

dc.identifier.urihttp://hdl.handle.net/11401/77810
dc.description.sponsorshipThis work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.en_US
dc.formatMonograph
dc.format.mediumElectronic Resourceen_US
dc.language.isoen_US
dc.publisherThe Graduate School, Stony Brook University: Stony Brook, NY.
dc.typeDissertation
dcterms.abstractWe built a Natural Language Processing (NLP) pipeline for each of Wikipedia languages through semi-supervised learning. Each pipeline consists of a language specific tokenizer, sentence segmenter, morphological analyzer, Part of Speech tagger, sentiment analysis, and Named Entity Recognition (NER) annotator. We automatically learn features (embedding) for each word in each language using continuous space language models, which capture syntactic and semantic characteristics of the language. We use these embeddings as features to train part of speech taggers with the help of human annotated datasets. To enable larger coverage of languages, we use these features with automatically-extracted annotations from Wikipedia to build a semi-supervised NER system. With strong prior (word embeddings) and simple statistical methods, we overcome the noise and bias introduced by the Wikipedia style guidelines. To demonstrate the quality of our work, we propose new evaluation metrics to accommodate to the large scale of languages we are targeting. Furthermore, all the pipelines are available to the community to use and study through the software package polyglot (available at http://polyglot-nlp.com).
dcterms.available2017-09-26T16:36:27Z
dcterms.contributorAkoglu, Lemanen_US
dcterms.contributorSkiena, Stevenen_US
dcterms.contributorChoi, Yejinen_US
dcterms.contributorBottou, Leon.en_US
dcterms.creatorAl-Rfou, Rami
dcterms.dateAccepted2017-09-26T16:36:27Z
dcterms.dateSubmitted2017-09-26T16:36:27Z
dcterms.descriptionDepartment of Computer Science.en_US
dcterms.extent108 pg.en_US
dcterms.formatMonograph
dcterms.formatApplication/PDFen_US
dcterms.identifierhttp://hdl.handle.net/11401/77810
dcterms.identifierAlRfou_grad.sunysb_0771E_12300.pdfen_US
dcterms.issued2015-05-01
dcterms.languageen_US
dcterms.provenanceSubmitted by Jason Torre (fjason.torre@stonybrook.edu) on 2017-09-26T16:36:27Z No. of bitstreams: 1 AlRfou_grad.sunysb_0771E_12300.pdf: 5379214 bytes, checksum: bb35b4646b68ffe959fc99c757145213 (MD5)en
dcterms.provenanceMade available in DSpace on 2017-09-26T16:36:27Z (GMT). No. of bitstreams: 1 AlRfou_grad.sunysb_0771E_12300.pdf: 5379214 bytes, checksum: bb35b4646b68ffe959fc99c757145213 (MD5) Previous issue date: 2015-05-01en
dcterms.publisherThe Graduate School, Stony Brook University: Stony Brook, NY.
dcterms.subjectComputer science
dcterms.subjectMachine Learning, Multilingual, Natural Language Processing
dcterms.titlePolyglot: A Massive Multilingual Natural Language Processing Pipeline
dcterms.typeDissertation


Files in this item

Thumbnail
Thumbnail

This item appears in the following Collection(s)

Show simple item record