Скачать презентацию The True Identity of a Computational Linguist Mariana Скачать презентацию The True Identity of a Computational Linguist Mariana

The True Identity of a Computational Linguist.pptx

  • Количество слайдов: 56

The True Identity of a Computational Linguist Mariana Romanyshyn Computational Linguist, Tech Lead at The True Identity of a Computational Linguist Mariana Romanyshyn Computational Linguist, Tech Lead at Grammarly Inc.

What do comp. linguists do? - Machine Translation Question-Answering Systems Sentiment Analysis Error Correction What do comp. linguists do? - Machine Translation Question-Answering Systems Sentiment Analysis Error Correction Text Summarization Search Engines News Ranking. . .

Competencies 1. Basic tech skills 2. Linguistics 3. Programming 4. NLP technologies Competencies 1. Basic tech skills 2. Linguistics 3. Programming 4. NLP technologies

1. Basic tech skills - Regular expressions - Shell commands - Text editors - 1. Basic tech skills - Regular expressions - Shell commands - Text editors - Logic - Version control systems

Regular expressions - great for matching strings - good for parsing structured data - Regular expressions - great for matching strings - good for parsing structured data - really-really cool ; )

Shell - forget about Windows; meet Linux and OS X - creating/moving/removing files/directories - Shell - forget about Windows; meet Linux and OS X - creating/moving/removing files/directories - grep - cat/sort/wc/uniq - vim/nano - ssh

Text Editors Forget about MS Office; use Sublime Text 2, Emacs or at least Text Editors Forget about MS Office; use Sublime Text 2, Emacs or at least Notepad++. Profit: - speed - macros - regular expressions - case conversions - line permutations - multiple cursors - marks - tabs - syntax highlighting - loads of cool modes’n’plugins + programming

Logic - truth tables - Euler diagrams - syllogism - laws of thought Logic - truth tables - Euler diagrams - syllogism - laws of thought

Version control systems Version control systems

2. Linguistics - Structural Linguistics - Pattern Recognition - Linguistic ambiguities - Research skills 2. Linguistics - Structural Linguistics - Pattern Recognition - Linguistic ambiguities - Research skills

Constituency Trees Constituency Trees

Dependency Trees Dependency Trees

Pattern Recognition: EC Pattern Recognition: EC

Pattern Recognition: EC Pattern Recognition: EC

Pattern Recognition: SA Apple dealt a blow to Samsung. Pattern Recognition: SA Apple dealt a blow to Samsung.

Pattern Recognition: SA Apple dealt a blow to Samsung. Apple dealt Samsung a blow. Pattern Recognition: SA Apple dealt a blow to Samsung. Apple dealt Samsung a blow. Samsung was dealt a blow by Apple. A blow was dealt to Samsung by Apple.

Phonetic ambiguities Phonetic ambiguities

Morphological ambiguities Morphological ambiguities

Syntactic ambiguities I’m glad I’m a man, and so is Lola. Syntactic ambiguities I’m glad I’m a man, and so is Lola.

Semantic ambiguities Semantic ambiguities

3. Programming - Scripting - OOP - Statistics - Scraping a web page - 3. Programming - Scripting - OOP - Statistics - Scraping a web page - Algorithms - Data structures

Scripting Language: Perl, Python, Ruby, etc. Simple text processing tasks: - collect test cases Scripting Language: Perl, Python, Ruby, etc. Simple text processing tasks: - collect test cases from corpora - compile dictionaries - parse corpora or dictionaries - count any kind of statistics

Object-oriented programming Language: Java, C++, Python, etc. Tasks that require creating/using non-primitive classes: - Object-oriented programming Language: Java, C++, Python, etc. Tasks that require creating/using non-primitive classes: - patterns and all kinds of rules - working with parse trees - working with NLP libraries

Statistics Language: R, Python, Octave/MATLAB, etc. Statistical libraries: - matrices and vectors - Machine Statistics Language: R, Python, Octave/MATLAB, etc. Statistical libraries: - matrices and vectors - Machine Learning - Neural Networks - Big Data

Scraping Techniques: XPath, Python (lxml, Beautiful. Soup, requests), Chrome Scraper, Kimono Labs, etc. Tasks: Scraping Techniques: XPath, Python (lxml, Beautiful. Soup, requests), Chrome Scraper, Kimono Labs, etc. Tasks: - get dictionaries/corpora/test cases/any kind of information from the web : ) - send requests to APIs P. S. Don’t parse web pages with regular expressions!

Algorithms - asymptotic notation (big O) - recursion - search - sort - shortest Algorithms - asymptotic notation (big O) - recursion - search - sort - shortest path

Data Structures - arrays linked lists stacks queues heaps hash-tables trees Data Structures - arrays linked lists stacks queues heaps hash-tables trees

4. NLP Technologies - NLP resources - NLP libraries - NLP algorithms: - Rule-based 4. NLP Technologies - NLP resources - NLP libraries - NLP algorithms: - Rule-based methods - Statistical methods

NLP Resources - dictionaries - thesauri - ontologies - word embeddings - corpora NLP Resources - dictionaries - thesauri - ontologies - word embeddings - corpora

Ontologies - Word. Net Frame. Net Verb. Net Concept. Net Image. Net Babel. Net Ontologies - Word. Net Frame. Net Verb. Net Concept. Net Image. Net Babel. Net Serelex

Word Embeddings Word 2 Vec (CBOW, Skip-Gram) / Glove W( Word Embeddings Word 2 Vec (CBOW, Skip-Gram) / Glove W("man") = (0. 2, -0. 4, 0. 7, . . . ) W("woman") = (0. 0, 0. 6, -0. 1, . . . ) W("woman")−W("man") ≃ W("aunt")−W("uncle") W("woman")−W("man") ≃ W("queen")−W("king")

Word Embeddings Word Embeddings

Corpora - NER-annotated POS tagged Treebanks Propbanks Coreference Semantics etc. - Brown Gutenberg Ontonotes Corpora - NER-annotated POS tagged Treebanks Propbanks Coreference Semantics etc. - Brown Gutenberg Ontonotes 5 English Web Treebank Penn Treebank Question. Bank Open American National Corpus

NLP Libraries: Open. NLP - Sentence Detector - Tokenizer - Name Finder - POS NLP Libraries: Open. NLP - Sentence Detector - Tokenizer - Name Finder - POS Tagger - Parser - Document Categorizer

NLP Libraries: Stanford Core NLP - Tokenization - Sentence Splitting - POS Tagging - NLP Libraries: Stanford Core NLP - Tokenization - Sentence Splitting - POS Tagging - Lemmatization - Syntactic Parsing - Coreference Resolution

NLP Libraries: NLTK - Classification Tokenization Stemming POS tagging Parsing Semantic Reasoning Corpora NLP Libraries: NLTK - Classification Tokenization Stemming POS tagging Parsing Semantic Reasoning Corpora

NLP Algorithms 1. Tokenization - regular expressions/rules - Machine Learning 2. Sentence Segmentation - NLP Algorithms 1. Tokenization - regular expressions/rules - Machine Learning 2. Sentence Segmentation - regular expressions/rules - Decision Trees/Neural Networks/Max. Ent

NLP Algorithms 3. POS Tagging - rules - Hidden Markov Models/Max. Ent 4. Named-Entity NLP Algorithms 3. POS Tagging - rules - Hidden Markov Models/Max. Ent 4. Named-Entity Recognition - patterns on regexps and POS tags - Max. Ent

NLP Algorithms 5. Syntactic parsing (constituency) - rule-based - top-down/bottom-up/chart parsers 6. Syntactic parsing NLP Algorithms 5. Syntactic parsing (constituency) - rule-based - top-down/bottom-up/chart parsers 6. Syntactic parsing (dependency) - rule-based - transition-based/graph-based

NLP Algorithms 7. Coreference Resolution - rule-based - Machine Learning 8. Word Sense Disambiguation NLP Algorithms 7. Coreference Resolution - rule-based - Machine Learning 8. Word Sense Disambiguation - frames - ontologies/word embeddings

NLP Algorithms Do you spot any potential problems? Mr. Jack, a high-flying company, filed NLP Algorithms Do you spot any potential problems? Mr. Jack, a high-flying company, filed a suit for $1. 5 B against Wilson and Sons and their investor, Second. Company & Co. The company will take them to court on May 16, 2014.

NLP Algorithms Do you spot any potential problems? Mr. Jack, a high-flying company, filed NLP Algorithms Do you spot any potential problems? Mr. Jack, a high-flying company, filed a suit for $1. 5 B against Wilson and Sons and their investor, Second. Company & Co. The company will take them to court on May 16, 2014.

NLP Algorithms Your main task - R&D: - research the task - develop an NLP Algorithms Your main task - R&D: - research the task - develop an algorithm - choose a stack of technologies - implement it

Resources Courses on Coursera: - NLP by D. Jurafsky and C. Manning (Stanford) - Resources Courses on Coursera: - NLP by D. Jurafsky and C. Manning (Stanford) - NLP by Michael Collins (Columbia) - NLP by Dragomir Radev (Michigan)

Resources Books: - Speech and Language Processing (D. Jurafsky and J. Martin) - NLP Resources Books: - Speech and Language Processing (D. Jurafsky and J. Martin) - NLP with Python (S. Bird) - Foundations of Statistical NLP (C. Manning and H. Schütze) - Handbook of Natural Language Processing

Resources Groups on Linked. In: - Computational Linguistics - Natural Language Processing People - Resources Groups on Linked. In: - Computational Linguistics - Natural Language Processing People - Natural Language Careers - Grammar Geeks and Word Nerds =)

Resources Conferences in Ukraine: - AI Ukraine (Kharkiv) - AI&Big. Data Lab (Odesa) - Resources Conferences in Ukraine: - AI Ukraine (Kharkiv) - AI&Big. Data Lab (Odesa) - IT Trends (Kremenchuk)

Resources Conferences abroad: - ACL https: //aclweb. org/ and NAACL - IWPT - TLT Resources Conferences abroad: - ACL https: //aclweb. org/ and NAACL - IWPT - TLT - Dep. Ling - Co. NLL

Companies in Ukraine Companies in Ukraine

Companies outside Ukraine A good place to look for jobs: http: //linguistlist. org Companies outside Ukraine A good place to look for jobs: http: //linguistlist. org

Thank you and see you at the interview ; ) Questions? Thank you and see you at the interview ; ) Questions?