297828d5e51ef753272503ac178a8e13.ppt
- Количество слайдов: 50
Vietnamese Language Processing: Issues and Challenges Ho Tu Bao Vietnamese Academy of Science and Technology Japan Advanced Institute of Science and Technology (Keynote talk at international conference IEEE RIVF 2009) IEEE RIVF’ 09, 16 July 2009
Institute of Information Technology Vietnamese Academy of Science & Technology Japan Advanced Institute of Science and Technology IEEE RIVF’ 09, 16 July 2009
Outline n Problems and progress in natural language processing n Issues and challenges in Vietnamese language processing n Our VLSP project (Vietnamese Language and Speech Processing) IEEE RIVF’ 09, 16 July 2009
Natural language processing? n Psychological view: Understand human language processing è n Alan Turing: Propose to consider the question: “Can machine think? ” Engineering view: Build systems to process language IEEE RIVF’ 09, 16 July 2009
More languages than you might have thought 6912 distinct languages (230 spoken in Europe, 2197 in Asia) n We meet here today to talk about processing of Vietnamese language and speech. n Aujourd'hui nous réunissons ici pour discuter le traitement de langue et de parole vietnamienne. n Cегодня мы встрачаемся здесь, чтобы говорить о обработке вьетнамского языкa и речи. n 今日我々はここに集まりベトナム語と発言処理について議論します。 n 오늘 우리는 여기에 모여서 베트남어와 발언처리에 대하여 의론하겠 습니다. n ﺃﻨﻨﺎ ﻧﺠﺘﻤﻊ ﻫﻨﺎ ﺍﻟﻴﻮﻡ ﻟﻨﺘﺤﺪﺙ ﻋﻦ ﺍﻟﻠﻐﺔ ﺍﻟﻔﻴﺘﻨﺎﻣﻴﺔ ﻭ ﻟﻐﺔ ﺍﻟﺨﻄﺎﺏ n Hôm nay chúng ta gặp nhau ở đây để nói về xử lý ngôn ngữ và tiếng nói tiếng Việt. IEEE RIVF’ 09, 16 July 2009
54 ethnic groups in Vietnam Language groups n Mon-Khmer n Tay-Thai n Tibeto. Burman n Malayo. Polysian n Kadai n Mong-Dao n Han IEEE RIVF’ 09, 16 July 2009
English websites and Vietnamese? IEEE RIVF’ 09, 16 July 2009
Translation and machine translation n Translate the following sentence into English “Ông già đi nhanh quá”? n Many possible translations 1. [Ông già] [đi] [nhanh quá] The old man walks too fast My father walks too fast 2. [Ông già] [đi] [nhanh quá] The old man died too fast My father died too fast 3. [Ông] [già đi] [nhanh quá] You get old too fast Grandfather gets old too fast Ambiguity of language IEEE RIVF’ 09, 16 July 2009
Two approaches to machine translation Linguistic rule-based machine translation è è words are translated by using linguistic rules about the two languages, the correspondence transfer between them (morphology, syntax, etc) Requires understanding natural language Statistical machine translation è generate translations using statistical learning methods based on bilingual text corpora (statistically similar) è Requires large and qualified bilingual text corpora. DOMINATING! IEEE RIVF’ 09, 16 July 2009
From text to the meaning Natural Language Processing (NLP) Lexical / Morphological Analysis Tagging text Shallow parsing The woman will give Mary a book POS tagging Chunking Syntactic Analysis Grammatical Relation Finding The/Det woman/NN will/MD give/VB Mary/NNP a/Det book/NN chunking Named Entity Recognition Word Sense Disambiguation [The/Det woman/NN]NP [will/MD give/VB]VP [Mary/NNP]NP [a/Det book/NN]NP Semantic Analysis Reference Resolution relation finding subject [The woman] [will give] [Mary] [a book] Discourse Analysis meaning i-object IEEE RIVF’ 09, 16 July 2009
Archeology of natural language processing n 1990 s– 2000 s: è Kernel (vector) spaces clustering, information retrieval (IR) n 1960 s: è Standard resources and tasks n Natural language processing Representation Transformation Finite state machines (FSM) and Augmented transition networks (ATNs) n 1960 s: Representation—beyond the word level è Trainable FSMs Penn Treebank, Word. Net, MUC n 1970 s: è Trainable parsers algorithms, evaluation, corpora n 1980 s: è Statistical learning lexical features, tree structures, networks (Hovy, COLING 2004) n Information retrieval and Information extraction IEEE RIVF’ 09, 16 July 2009
ML and statistical methods in NLP some ML/Stat no ML/Stat (Pages 11 -12 from Marie Claire, ECML/PKDD 2005) IEEE RIVF’ 09, 16 July 2009
Recent learning methods in NLP IEEE RIVF’ 09, 16 July 2009
NLP R&D in other countries n Large investment from the government and industry è è n National Institute of Standards and Technology (NIST), ATR, NICT USA, CHINA, Singapore, etc. NLP & CL organizations è è è ACL (Assoc. Comp. Linguistics) NACL( North Amer. Assoc. on CL) EACL (Euro Association on CL) PACLIC (Pacific Assoc. on CL) ICCL (Inter. committee CL) n Many NLP people n Linguistic Data Consortium Rich resources and tools IEEE RIVF’ 09, 16 July 2009
Vietnamese language n Vietnamese language was established a long time ago n Chinese characters was used for a long time n Unique writing system of Vietnam called Chu Nom (字喃) in the 10 th century n Romanced script to represent the Quốc Ngữ since the beginning of the 20 th century Nam quốc sơn hà Nam đế cư 南国山河南帝居 Over Mountains and Rivers of the South, Reigns the Emperor of the South IEEE RIVF’ 09, 16 July 2009
Vietnamese language n Vietnamese is an analytic language (words are composed of a single morpheme). è n Vietnamese does not use morphological marking of case, gender, number, and tense. è n ngôn ngữ (analytic), lang-gua-ge (synthetic), 言語 (synthetic) Trưa nay tôi ăn ba thằng tôm Syntax conforms to Subject Verb Object word order è Cái thằng chồng em nó chẳng ra gì. FOCUS CLASSIFIER husband I he not turn. out what “That husband of mine, he is good for nothing. ” IEEE RIVF’ 09, 16 July 2009
Vietnamese Language and Speech Processing n Most work aims at machine translation or other tasks at top layers but very few basic work at lower layers n Work done in isolation, no inheritance people have to do their work from the scratch without sharing and collaborating no standards. n Almost no resources and tools for VLSP このひとことで元気になった Many tools such as Cha. Sen, Yamcha, … No tool to do such a simple task IEEE RIVF’ 09, 16 July 2009
VLSP national project (KC 01. 05/06 -10) 5. 2007 -8. 2009 National project with eleven active research VLSP groups from Ho Chi Minh City to Hanoi, with two objectives: Pragmatics: Speech, text and Web data mining Building VLSP infrastructure, especially indispensable resources and tools for the VLSP development. Building and developing several typical VLSP products for public end-users. Natural language processing methods Tools, corpora, resources IEEE RIVF’ 09, 16 July 2009
Project target products SP 1 Apllicationoriented systems based on Vietnamese speech recognition & synthesis SP 8. 1 Speech analysis tools SP 6. 1 Corpora for speech recognition SP 6. 2 Corpora for speech synthesis SP 6. 3 Corpora for specific words SP 7. 4 E-V corpora of aligned sentences SP 7. 1 English-Vietnamese dictionary SP 2 Speech recognition system with large vocabulary SP 7. 3 Vietnamese treebank SP 3 English-Vietnamese translation system SP 7. 2 Viet dictionary SP 8. 2 Vietnamese word Segmentation SP 8. 3 Vietnamese POS tagger SP 8. 4 Vietnamese chunker SP 8. 5 Vietnamese syntax analyser SP 5 Vietnamese spelling checker SP 4 IREST: Internet use support system To be standard for long term development IEEE RIVF’ 09, 16 July 2009
VLSP website: open soon to the public IEEE RIVF’ 09, 16 July 2009
SP 7. 2: Viet Machine Readable Dictionary n Study other MRDs Electronic Dictionary è Frame. Net (UC Berkeley) è TCL's Computational Lexicon Institute of Electronic Dictionary, 1980 s-1990 s è EDR n Build a model of VCL (Vietnamese Computational Lexicon) Japanese EDR The macroscopic structure è The microscopic structure è The content and VCL structure è Tool and VCL construction è IEEE RIVF’ 09, 16 July 2009
SP 7. 2: Viet Machine Readable Dictionary n Microscopic structure Morphological information è Syntactic information, e. g. , two kinds of verb è Semantic information: logic and semantic constraints, definition, context è n n Sub-V-Obj Lợn ăn rau Xe ăn xăng Sub-V Chim bay Chó chạy bé ngủ bé đang ngủ VCL content and structure Tool for the construction n 35, 000 common used words in modern Vietnamese n Develop a tool for building VCL with XML representation. IEEE RIVF’ 09, 16 July 2009
SP 7. 3: Viet Treebank n A Treebank or parsed corpus is a text corpus in which each sentence has been parsed, i. e. annotated with syntactic structure. S NP English: Penn Treebank (4. 5 M words) and many P others; Ông già è Chinese: Penn Chinese Treebank (507 K words), Sinica Treebank (61, 087 trees, 361 K words); è Japanese: ATR Dependency corpus, Kyoto Text Corpus, Verbmobil treebanks; Viet Treebank è Korean: Korean Treebank (5078 trees, 54 K words) è n Viet Treebank (7. 2007 -5. 2009): 10, 000 trees è 1, 000 morphemes Viet word segmenter VP V NP đi T nhanh quá Viet POS Viet tagger chunker syntactic parser è Viet machine translation, info extraction, etc. IEEE RIVF’ 09, 16 July 2009
SP 7. 3: Viet Treebank n Study various existing treebanks, modern theories for syntax and Vietnamese language n Build guidelines for word segmentation, POS, and syntax è “Nhà cửa bề bộn quá” and “Ở nhà cửa ngõ chẳng đóng gì cả” (“the house is in jumble” and “at home the door is not closed”) è “Cô ấy giữ gìn sắc đẹp” and “Bức này màu sắc đẹp hơn” (She keeps her beauty” and “this painting has better color”) n Build the tools n Labeling Agreement between labelers (95%) IEEE RIVF’ 09, 16 July 2009
SP 7. 4: English-Vietnamese parallel corpus n n Pairs of corresponding sentences in English and Vietnamese (size & quality) Easy for many languages (LDC: English-French corpus of 2. 8 M sentences, source from Canadian Parliament) No publicly available parallel corpus for Vietnamese Building corpora needs time, money and human resources (boring job) Parallel Corpus (L 1 -L 2) Sentences L 1 Words L 2 Words German-English 1, 313, 096 34, 700, 362 36, 663, 083 Greek-English 662, 090 18, 834, 758 18, 827, 241 Spanish-English 1, 304, 116 37, 870, 751 36, 429, 274 Finnish-English 1, 257, 720 24, 895, 790 34, 802, 617 French-English 1, 334, 080 41, 573, 117 37, 436, 222 Italian-English 1, 251, 315 36, 411, 166 36, 510, 033 Dutch-English 1, 326, 412 36, 784, 168 36, 690, 392 Portuguese-English 1, 287, 757 37, 342, 426 36, 355, 907 Swedish-English 1, 164, 536 28, 882, 142 32, 053, 628 (http: //www. euromatrix. net) IEEE RIVF’ 09, 16 July 2009
SP 7. 4: English-Vietnamese parallel corpus n Our corpus in first phase: 100, 000 sentence pairs è è n Manual and semiautomatic collection of parallel text Automatic alignment Automatic cross-site parallel text discovery è Many Vietnamese news are translated from English source in other web sites from the Internet IEEE RIVF’ 09, 16 July 2009
Setting up the “standards” for VLSP n Importance of “standards” in VLSP: choose an appropriate view from different schools on Vietnamese language n Guide for words recognition and description: morphological, syntactic, semantic criteria n Guide for constituent labeling: noun phrase, verb phrase, clause, etc. n Guide for sentence split n Others n Challenge: Standards for sustainable development IEEE RIVF’ 09, 16 July 2009
Example: Guideline for POS tagging n 36 word labels in English, from Penn Treebank (1989) n 30 word labels in Chinese, from Chinese Tree. Bank (1998) n 47 word labels in Thai, from Orchid corpus (1997) n How many for Vietnamese? IEEE RIVF’ 09, 16 July 2009
VLSP tools for the public n n n All the tools are constructed based on the same view of words, label assignment, sentences, and resources. Using statistical and machine learning methods to build the tools with the corpora. Tools and resources are to be given to the public. SP 7. 3 Vietnamese treebank SP 7. 4 E-V corpora of aligned sentences SP 7. 1 English-Vietnamese dictionary SP 7. 2 Viet dictionary SP 8. 2 Vietnamese word segmentation SP 8. 3 Vietnamese POS tagger SP 8. 4 Vietnamese chunker SP 8. 5 Vietnamese syntax analyser IEEE RIVF’ 09, 16 July 2009
Using machine learning in creating tools Finite state machines CRF dynamic programming - O(|L|2 T): first-order - O(|L|3 T): second-order (L is the set of labels) MEMMs and CRFs are emerging techniques in NLP & machine learning IEEE RIVF’ 09, 16 July 2009
SP 8. 4: Statistical learning in chunking Data CRFs Online Learning Chunking models Anh ấy đang ăn cơm Vietnamese Sentences NP [anh ấy] VP [đang ăn cơm] Decoding output IEEE RIVF’ 09, 16 July 2009
SP 8. 5: HGSP grammar for syntax analysis HGSP allows us to represent the relations between words and constraints between syntax and semantics Text Word segmentation module Word feature dictionary Improved earley algorithm Analysis module Rules with constraints Syntax tree Attribute elimination HGSP: Head-Driven Phrase Structure Grammar IEEE RIVF’ 09, 16 July 2009
SP 3: Machine translation and EVSMT 1. 0 SMT: statistical machine translation Vietnamese. English Bilingual Text English Text Statistical Analysis Vietnamese Ông già đi nhanh quá Statistical Analysis Broken English Died the old man too fast The old man too fast died The old man died too fast Old man died the too fast (Slides 31 -32 adapted from tutorial on SMT, K. Knight and P. Koehn) English The old man died too fast IEEE RIVF’ 09, 16 July 2009
SP 3: Machine translation and EVSMT 1. 0 SMT: statistical machine translation Vietnamese. English Bilingual Text English Text Statistical Analysis Broken English Vietnamese Translation Model English Language Model Decoding Algorithm Argmax P(v|e) x P(e) IEEE RIVF’ 09, 16 July 2009
SP 3: Machine translation and EVSMT 1. 0 SMT core English sentence Decoder (search problem) MOSES Vietnamese sentence è è Translation Model (phrase-based) Language Model SRILM -GIZA++ -MOSES -MERT Vietnamese. English Parallel corpus - Standardization - Word segmentation (VNsegmenter) - POS tagger (CRF Postagger, Vn. Qtag) - Morphological analyser (morpha) è è è Pre-processing Issues in Vietnamese SMT Corpus building Language Modeling Translation Model Decoder Others Pre-processing Vietnamese corpus SMT Resource processing - Pre-process (sentence splitter, tokenizer, etc. ), Web crawler - Sentence alignment tools Raw materials (documents, books, …) Automatic extract parallel text from the Web Corpus collecting and building IEEE RIVF’ 09, 16 July 2009
Google: English-Vietnamese translation 26. 9. 08 (translate. google. com, 35 languages) IEEE RIVF’ 09, 16 July 2009
Machine translation issues and challenges n SMT major difficulties: word choice, word order, tense and aspect, pronoun, idioms n Target: Improve phrase-based SMT in two aspects of word order and word choice n Combination of tree-to-string SMT and phrasebased SMT (N. P. Thai et al. , Machine translation, Vol. 20. No. 3 (2006), IJCPOL, Vol. 20, No. 2 (2007) n Focus on translating long and complex sentences by introducing CRF-based clause splitting and chunking parsing (N. V. Vinh, IJCPOL, 2009) IEEE RIVF’ 09, 16 July 2009
IREST: Support for exploiting the Internet (Information Retrieval, Extraction, Summarization, Translation) List of Websites in English Danh sách Websites tiếng Việt Translate the list of retrieved Webpages into Vietnamese 1 Different types of query entries Translate the selected Website into Vietnamese 2 3 Check each Website Search on Internet for Webpages having information related to the query 4 Extract news related to the query Text related to the query Translate the gist into Vietnamese. Selected Website in English Trang Web được dịch qua tiếng Việt Extract information related to the query Summarize the text Summarized text in English Summarize the text for its gist Tin tóm tắt được dịch sang tiếng Việt IEEE RIVF’ 09, 16 July 2009
Vietnamese named-entities on the web Ho Chi Minh University of Technology IEEE RIVF’ 09, 16 July 2009
Sentence reduction by SVM n Input Corpus Long sentence Parsing all sentences parsing Tree set Large parsed tree Generating training data Set of contexts and actions n Actions: SHIFT, REDUCE, DROP, ASSIGN TYPE, RESTORE n Transforming tree is a sequence of actions {a, b, c, d, e} transforming Small parsed tree SVM learning Rules list, CSTACK, RSTACK generating A rule shows relation of a context and an action Short sentence {b, e, a} (Minh et al. , COLING 2004, J. CPOL’ 05, ACM Trans. ALP’ 05, IEICE’ 06) IEEE RIVF’ 09, 16 July 2009
Emerging trend detection (Le Minh Hoang, KSS journal, 2006) ETD: Detecting topics that are growing in interest and utility overtime from a corpus Topic verification How to define interest and utility functions and evaluate their increase overtime? M = (D, E, T, TR, TI, TV, f, g) Topic representation Which features are necessary to characterize topics (interest and utility overtime)? Topic identification How to extract these features from the corpus for each topic? IEEE RIVF’ 09, 16 July 2009
ETD: Topic representation ETD: Detecting topics that are growing in interest and utility Define overtime from a corpus 6 types of citation Topic representation Which features are necessary to characterize topics (interest and utility overtime)? neural network IEEE RIVF’ 09, 16 July 2009
ETD: Topic identification ETD: Detecting topics that are growing in interest and utility overtime from a corpus § Build 6 models corresponding to 6 types of citation § Using HMM, MEMM, an CRF to extract features Topic identification How to extract these features from the corpus for each topic? IEEE RIVF’ 09, 16 July 2009
ETD: Topic verification ETD: Detecting topics that are growing in interest and utility overtime from a corpus Topic verification How to define interest and utility functions and evaluate their increase overtime? the speed of growing at x = k the acceleration of growing at x = k 0 1 2 Speed > 0 Acceleration > 0 IEEE RIVF’ 09, 16 July 2009
ETD: Evaluation ETD: Detecting topics that are growing in interest and utility overtime from a corpus IEEE RIVF’ 09, 16 July 2009
Conclusion n Complete the first phase in VLSP infrastructure. n Advanced technologies and experience from processing of other languages, especially statistical learning from large corpora. n Work in collaboration and sharing n Look for investment from the government and industry for the next phase, and for collaboration. IEEE RIVF’ 09, 16 July 2009
Acknowledgements n The national project KC 01. 05/06 -10 n Projects members: Luong Chi Mai, Ngo Cao VLSP meeting, 21 -25 Nov. 2005, JAIST Son, Ho Bao Quoc, Dinh Dien, Cao Hoang Tru, Nguyen Thi Minh Huyen, Vu Luong, Le Thanh Huong, Nguyen Phuong Thai, Nguyen Le Minh, Le Minh Hoang, Phan Xuan Hieu, Pham Ngoc Khanh, Ha Thanh Le, Nguyen Phuong Thao, Nguyen Viet Cuong, VLSP forum, among others. IEEE RIVF’ 09, 16 July 2009
Korean and Arbic 오늘 우리는 여기에 모여서 베트남어와 발언처리에 대하여 의론하겠습니다. Ônưr wulinân iôkiê môiôshô Vietnamơwa balântsơriê têhaiô ưi rônhagếtsưnnita ﺃﻨﻨﺎ ﻧﺠﺘﻤﻊ ﻫﻨﺎ ﺍﻟﻴﻮﻡ ﻟﻨﺘﺤﺪﺙ ﻋﻦ ﺍﻟﻠﻐﺔ ﺍﻟﻔﻴﺘﻨﺎﻣﻴﺔ ﻭ ﻟﻐﺔ ﺍﻟﺨﻄﺎﺏ enna ngtma hena alyom lenthds an alloga alvitnamya wa logh alkhytab Inana nagtama huna alyom linatahades an: alloga alvitnemâyơ ôe loga alkhytab Kyou, ware wa kokoni atsumari, Betonamu-go to speech shori ni tsuite Giron shimasu IEEE RIVF’ 09, 16 July 2009
Search for parallel document n Observation: Many Vietnamese news are translated from English source in other web sites from the Internet n Methodology è Make use of search engine to find English candidates è Queries are created from ð Posted ð News date source’s URL ð Translational independent data: text data unchanged during translation process. E. g. term, NE, number IEEE RIVF’ 09, 16 July 2009
Framework • Queries are generated and executed from high ranks to low ranks • Filtering • Length-based • TID-based IEEE RIVF’ 09, 16 July 2009