
The True Identity of a Computational Linguist.pptx
- Количество слайдов: 56
The True Identity of a Computational Linguist Mariana Romanyshyn Computational Linguist, Tech Lead at Grammarly Inc.
What do comp. linguists do? - Machine Translation Question-Answering Systems Sentiment Analysis Error Correction Text Summarization Search Engines News Ranking. . .
Competencies 1. Basic tech skills 2. Linguistics 3. Programming 4. NLP technologies
1. Basic tech skills - Regular expressions - Shell commands - Text editors - Logic - Version control systems
Regular expressions - great for matching strings - good for parsing structured data - really-really cool ; )
Shell - forget about Windows; meet Linux and OS X - creating/moving/removing files/directories - grep - cat/sort/wc/uniq - vim/nano - ssh
Text Editors Forget about MS Office; use Sublime Text 2, Emacs or at least Notepad++. Profit: - speed - macros - regular expressions - case conversions - line permutations - multiple cursors - marks - tabs - syntax highlighting - loads of cool modes’n’plugins + programming
Logic - truth tables - Euler diagrams - syllogism - laws of thought
Version control systems
2. Linguistics - Structural Linguistics - Pattern Recognition - Linguistic ambiguities - Research skills
Constituency Trees
Dependency Trees
Pattern Recognition: EC
Pattern Recognition: EC
Pattern Recognition: SA Apple dealt a blow to Samsung.
Pattern Recognition: SA Apple dealt a blow to Samsung. Apple dealt Samsung a blow. Samsung was dealt a blow by Apple. A blow was dealt to Samsung by Apple.
Phonetic ambiguities
Morphological ambiguities
Syntactic ambiguities I’m glad I’m a man, and so is Lola.
Semantic ambiguities
3. Programming - Scripting - OOP - Statistics - Scraping a web page - Algorithms - Data structures
Scripting Language: Perl, Python, Ruby, etc. Simple text processing tasks: - collect test cases from corpora - compile dictionaries - parse corpora or dictionaries - count any kind of statistics
Object-oriented programming Language: Java, C++, Python, etc. Tasks that require creating/using non-primitive classes: - patterns and all kinds of rules - working with parse trees - working with NLP libraries
Statistics Language: R, Python, Octave/MATLAB, etc. Statistical libraries: - matrices and vectors - Machine Learning - Neural Networks - Big Data
Scraping Techniques: XPath, Python (lxml, Beautiful. Soup, requests), Chrome Scraper, Kimono Labs, etc. Tasks: - get dictionaries/corpora/test cases/any kind of information from the web : ) - send requests to APIs P. S. Don’t parse web pages with regular expressions!
Algorithms - asymptotic notation (big O) - recursion - search - sort - shortest path
Data Structures - arrays linked lists stacks queues heaps hash-tables trees
4. NLP Technologies - NLP resources - NLP libraries - NLP algorithms: - Rule-based methods - Statistical methods
NLP Resources - dictionaries - thesauri - ontologies - word embeddings - corpora
Ontologies - Word. Net Frame. Net Verb. Net Concept. Net Image. Net Babel. Net Serelex
Word Embeddings Word 2 Vec (CBOW, Skip-Gram) / Glove W("man") = (0. 2, -0. 4, 0. 7, . . . ) W("woman") = (0. 0, 0. 6, -0. 1, . . . ) W("woman")−W("man") ≃ W("aunt")−W("uncle") W("woman")−W("man") ≃ W("queen")−W("king")
Word Embeddings
Corpora - NER-annotated POS tagged Treebanks Propbanks Coreference Semantics etc. - Brown Gutenberg Ontonotes 5 English Web Treebank Penn Treebank Question. Bank Open American National Corpus
NLP Libraries: Open. NLP - Sentence Detector - Tokenizer - Name Finder - POS Tagger - Parser - Document Categorizer
NLP Libraries: Stanford Core NLP - Tokenization - Sentence Splitting - POS Tagging - Lemmatization - Syntactic Parsing - Coreference Resolution
NLP Libraries: NLTK - Classification Tokenization Stemming POS tagging Parsing Semantic Reasoning Corpora
NLP Algorithms 1. Tokenization - regular expressions/rules - Machine Learning 2. Sentence Segmentation - regular expressions/rules - Decision Trees/Neural Networks/Max. Ent
NLP Algorithms 3. POS Tagging - rules - Hidden Markov Models/Max. Ent 4. Named-Entity Recognition - patterns on regexps and POS tags - Max. Ent
NLP Algorithms 5. Syntactic parsing (constituency) - rule-based - top-down/bottom-up/chart parsers 6. Syntactic parsing (dependency) - rule-based - transition-based/graph-based
NLP Algorithms 7. Coreference Resolution - rule-based - Machine Learning 8. Word Sense Disambiguation - frames - ontologies/word embeddings
NLP Algorithms Do you spot any potential problems? Mr. Jack, a high-flying company, filed a suit for $1. 5 B against Wilson and Sons and their investor, Second. Company & Co. The company will take them to court on May 16, 2014.
NLP Algorithms Do you spot any potential problems? Mr. Jack, a high-flying company, filed a suit for $1. 5 B against Wilson and Sons and their investor, Second. Company & Co. The company will take them to court on May 16, 2014.
NLP Algorithms Your main task - R&D: - research the task - develop an algorithm - choose a stack of technologies - implement it
Resources Courses on Coursera: - NLP by D. Jurafsky and C. Manning (Stanford) - NLP by Michael Collins (Columbia) - NLP by Dragomir Radev (Michigan)
Resources Books: - Speech and Language Processing (D. Jurafsky and J. Martin) - NLP with Python (S. Bird) - Foundations of Statistical NLP (C. Manning and H. Schütze) - Handbook of Natural Language Processing
Resources Groups on Linked. In: - Computational Linguistics - Natural Language Processing People - Natural Language Careers - Grammar Geeks and Word Nerds =)
Resources Conferences in Ukraine: - AI Ukraine (Kharkiv) - AI&Big. Data Lab (Odesa) - IT Trends (Kremenchuk)
Resources Conferences abroad: - ACL https: //aclweb. org/ and NAACL - IWPT - TLT - Dep. Ling - Co. NLL
Companies in Ukraine
Companies outside Ukraine A good place to look for jobs: http: //linguistlist. org
Thank you and see you at the interview ; ) Questions?