Show simple item record

dc.identifier.urihttp://hdl.handle.net/11401/77292
dc.description.sponsorshipThis work is sponsored by the Stony Brook University Graduate School in compliance with the requirements for completion of degree.en_US
dc.formatMonograph
dc.format.mediumElectronic Resourceen_US
dc.language.isoen_US
dc.publisherThe Graduate School, Stony Brook University: Stony Brook, NY.
dc.typeThesis
dcterms.abstractWith the massive amounts of unannotated text available from myriad sources, learning representations useful for natural language processing(NLP) tasks is an increasingly popular research area. Using deep learning techniques, we learn distributed representations for words (word embeddings) using Wikipedia as the source of text for 40 languages. These distributed representations represent each word as a point in feature space and capture useful semantic and syntactic properties of words amd have been shown to be useful in NLP Tasks like Part of Speech Tagging(POS) etc. We have built 2 classes of word embeddings namely Polyglot and Skipgram for these languages. We build a named entity recognition (NER) system that supports 40 languages using the word embeddings we have generated as features and seek to use freely available Wikipedia text as training data. This involves training language models to obtain the word embeddings, understanding the properties of the learnt word embeddings and culminates in learning models for named entity classification. We also present a novel technique for evaluating our performance on the myriad languages for which no gold data set for testing exists. Our results demonstrate that word embeddings exhibit nice community structure and can be used effectively for NER with no explicit hand crafted feature engineering and perform competitively with existing baselines when coupled with simple language agnostic techniques.
dcterms.available2017-09-20T16:52:21Z
dcterms.contributorAkoglu, Lemanen_US
dcterms.contributorSkiena, Stevenen_US
dcterms.contributorChoi, Yejinen_US
dcterms.contributorRamakrishnan, I.V.en_US
dcterms.creatorKulkarni, Vivek V.
dcterms.dateAccepted2017-09-20T16:52:21Z
dcterms.dateSubmitted2017-09-20T16:52:21Z
dcterms.descriptionDepartment of Computer Science.en_US
dcterms.extent54 pg.en_US
dcterms.formatApplication/PDFen_US
dcterms.formatMonograph
dcterms.identifierhttp://hdl.handle.net/11401/77292
dcterms.issued2014-12-01
dcterms.languageen_US
dcterms.provenanceMade available in DSpace on 2017-09-20T16:52:21Z (GMT). No. of bitstreams: 1 Kulkarni_grad.sunysb_0771M_11763.pdf: 3131809 bytes, checksum: 5416190cdda412864b5ebb804ccc6964 (MD5) Previous issue date: 1en
dcterms.publisherThe Graduate School, Stony Brook University: Stony Brook, NY.
dcterms.subjectComputer science
dcterms.subjectbig data, complex networks, machine learning, named entity recognition, natural language processing
dcterms.titleMultilingual Named Entity Recognition
dcterms.typeThesis


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record