dc.identifier.uri	http://hdl.handle.net/11401/77810
dc.description.sponsorship	This work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.	en_US
dc.format	Monograph
dc.format.medium	Electronic Resource	en_US
dc.language.iso	en_US
dc.publisher	The Graduate School, Stony Brook University: Stony Brook, NY.
dc.type	Dissertation
dcterms.abstract	We built a Natural Language Processing (NLP) pipeline for each of Wikipedia languages through semi-supervised learning. Each pipeline consists of a language specific tokenizer, sentence segmenter, morphological analyzer, Part of Speech tagger, sentiment analysis, and Named Entity Recognition (NER) annotator. We automatically learn features (embedding) for each word in each language using continuous space language models, which capture syntactic and semantic characteristics of the language. We use these embeddings as features to train part of speech taggers with the help of human annotated datasets. To enable larger coverage of languages, we use these features with automatically-extracted annotations from Wikipedia to build a semi-supervised NER system. With strong prior (word embeddings) and simple statistical methods, we overcome the noise and bias introduced by the Wikipedia style guidelines. To demonstrate the quality of our work, we propose new evaluation metrics to accommodate to the large scale of languages we are targeting. Furthermore, all the pipelines are available to the community to use and study through the software package polyglot (available at http://polyglot-nlp.com).
dcterms.available	2017-09-26T16:36:27Z
dcterms.contributor	Akoglu, Leman	en_US
dcterms.contributor	Skiena, Steven	en_US
dcterms.contributor	Choi, Yejin	en_US
dcterms.contributor	Bottou, Leon.	en_US
dcterms.creator	Al-Rfou, Rami
dcterms.dateAccepted	2017-09-26T16:36:27Z
dcterms.dateSubmitted	2017-09-26T16:36:27Z
dcterms.description	Department of Computer Science.	en_US
dcterms.extent	108 pg.	en_US
dcterms.format	Monograph
dcterms.format	Application/PDF	en_US
dcterms.identifier	http://hdl.handle.net/11401/77810
dcterms.identifier	AlRfou_grad.sunysb_0771E_12300.pdf	en_US
dcterms.issued	2015-05-01
dcterms.language	en_US
dcterms.provenance	Submitted by Jason Torre (fjason.torre@stonybrook.edu) on 2017-09-26T16:36:27Z No. of bitstreams: 1 AlRfou_grad.sunysb_0771E_12300.pdf: 5379214 bytes, checksum: bb35b4646b68ffe959fc99c757145213 (MD5)	en
dcterms.provenance	Made available in DSpace on 2017-09-26T16:36:27Z (GMT). No. of bitstreams: 1 AlRfou_grad.sunysb_0771E_12300.pdf: 5379214 bytes, checksum: bb35b4646b68ffe959fc99c757145213 (MD5) Previous issue date: 2015-05-01	en
dcterms.publisher	The Graduate School, Stony Brook University: Stony Brook, NY.
dcterms.subject	Computer science
dcterms.subject	Machine Learning, Multilingual, Natural Language Processing
dcterms.title	Polyglot: A Massive Multilingual Natural Language Processing Pipeline
dcterms.type	Dissertation

Files in this item

Name:: AlRfou_grad.sunysb_0771E_12300.pdf
Size:: 5.130Mb
Format:: application/pdf

View/Open

Name:: documentation.pdf
Size:: 220.2Kb
Format:: application/pdf

View/Open

This item appears in the following Collection(s)

Stony Brook Theses and Dissertations Collection [4009]

Show simple item record

Polyglot: A Massive Multilingual Natural Language Processing Pipeline

Files in this item

This item appears in the following Collection(s)