Скачать презентацию Introduction to Computational Linguistics Dipti Misra Sharma IIIT Скачать презентацию Introduction to Computational Linguistics Dipti Misra Sharma IIIT

0619a9eeb57b6b6f339f4e0cb2d97172.ppt

  • Количество слайдов: 76

Introduction to Computational Linguistics Dipti Misra Sharma IIIT, Hyderabad <dipti@iiit. ac. in> IASNLP 05 Introduction to Computational Linguistics Dipti Misra Sharma IIIT, Hyderabad IASNLP 05 -07 -2012

Outline Background What is Computational Linguistics (CL)? What do the Computational Linguists do? What Outline Background What is Computational Linguistics (CL)? What do the Computational Linguists do? What are the issues in processing natural languages? What can we do with CL? Approaches in CL?

Background Language is a means of communication Therefore, one can say It encodes what Background Language is a means of communication Therefore, one can say It encodes what is communicated We apply the processes of Analysis (decoding) for understanding Synthesis (encoding) for expression (speaking)

What do we communicate ? Information (SPAIN delivered a football masterclass at Euro 2012) What do we communicate ? Information (SPAIN delivered a football masterclass at Euro 2012) Intention Emphasis/focus (Euro 2012 won by Spain/ Spain bags Euro 2012) Introduces variation

How do we communicate ? We use linguistic elements such as Words (country, park, How do we communicate ? We use linguistic elements such as Words (country, park, the, is, Bandipur, of, as, and, considered, National, a, spot, beautiful, tourist, life, in, best, wild, sanctuaries, the, one) Arrangement of the words (Sentences) Words are related to each-other to provide the composite meaning (Bandipur National park is a beautiful tourist spot and considered as one of the best wild life sanctuaries in the country)

How do we communicate ? Arrangement of sentences (Discourse) Sentences or parts of sentences How do we communicate ? Arrangement of sentences (Discourse) Sentences or parts of sentences are related to each other to provide a cohesive meaning *(Considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km. Bandipur National park is a beautiful tourist spot. ) (Bandipur National park is a beautiful tourist spot and considered as one of the best wild life sanctuaries in the country. It is a national park covering an area of about 874 km) Languages differ in the way they organise information in these entities All of these interact in the organisation of information

What is Computational Linguistics? Computational linguistics is the scientific study of language from a What is Computational Linguistics? Computational linguistics is the scientific study of language from a computational perspective.

What does it mean? Scientific Provides explanation for a linguistic or psycholinguisitc phenomenon Computational What does it mean? Scientific Provides explanation for a linguistic or psycholinguisitc phenomenon Computational Develops computational models/techniques for linguistic phenomena Human language is the subject of study

In other words Computational linguistics is the application of linguistic theories and computational techniques In other words Computational linguistics is the application of linguistic theories and computational techniques to problems of natural language processing. http: //www. ba. umist. ac. uk/public/departments/registrars/academic office/uga/lang. htm

What do the Computational Linguists do? Linguistic research Develop language models for processing natural What do the Computational Linguists do? Linguistic research Develop language models for processing natural languages Develop language resources for NLP research/applications Understand develop models for analysis and generation of natural languages by the computers

So, A Computational Linguist needs to understand How language works What information is available So, A Computational Linguist needs to understand How language works What information is available in the language? How languages encode information? How this knowledge/information can be representated for computational processing?

Information in Language (1/4) Languages encode information cuuhe maarate hai. N kutte rats kill Information in Language (1/4) Languages encode information cuuhe maarate hai. N kutte rats kill dogs Hindi sentence is ambiguous Possible interpretations Dogs kill rats Rats kill dogs However, English sentence is not ambiguous

Information in Language (2/4) Ambiguity in Hindi is resolved if, cuuhe maarate hai. M Information in Language (2/4) Ambiguity in Hindi is resolved if, cuuhe maarate hai. M kutto. N ko rats kill dogs acc English encodes information in positions Hindi in morphemes Languages encode information differently

Information in Language (3/4) Another example, This chair has been sat on – The Information in Language (3/4) Another example, This chair has been sat on – The chair has been used for sitting – X sat on this chair, and it is known – The sentence does not mention X Languages encode information partially

Information in Language (4/4) English pronouns Hindi pronoun he, she, it vaha He is Information in Language (4/4) English pronouns Hindi pronoun he, she, it vaha He is going to Delhi ==> vaha dilli jaa rahaa hai She is going to Delhi ==> vaha dillii jaa rahii hai It broke ==> vaha Tuu. Ta ? ? Information does not always map fully from one language into another Conceptual worlds may be different

Differences ? Words English boys <n, pl> Hindi Telugu la. Dake/la. Dako. N <n, Differences ? Words English boys Hindi Telugu la. Dake/la. Dako. N He/she/it vaha atanu/aame/adi is/am/are hai/huu. N/hai. N/ho is going jaa rahaa hai/rahii hai/rahe hai. N

Indian Languages Relatively flexible word order 1. a) baccaa phala khaataa ‘child’ hai ‘fruit’ Indian Languages Relatively flexible word order 1. a) baccaa phala khaataa ‘child’ hai ‘fruit’ ‘eat+hab’ ‘pres’ The child eats fruits b) phala baccaa khaataa hai c) phala khaataa hai baccaa d) baccaa khaataa hai phala

Some structural differences English Declarative : Ravi is coming today Interrogative : Is Ravi Some structural differences English Declarative : Ravi is coming today Interrogative : Is Ravi coming today ? Change in the position of ‘is’ brings the change in meaning Hindi Declarative : ravi aaj aa rahaa hai Interrogative : kyaa ravi aaj aa rahaa hai ? Word ‘kyaa’ encodes the question information Alternatively, more natural spoken form in Hindi ravi aaj aa rahaa hai ? (with appropriate intonation) OR Ravi aaj aa rahaa hai kyaa?

Post nominal modification 'ing' clauses I know [the man playing guitar] Hindi, on the Post nominal modification 'ing' clauses I know [the man playing guitar] Hindi, on the other hand mai. N [gi. Taar bajaa rahe vyakti ko] jaanataa huu. N

 Clauses having 'un-' negative constructions English Unless you reach there the job will Clauses having 'un-' negative constructions English Unless you reach there the job will not be done Hindi jab tak tum vahaa. N nahii. N pahu. Ncate , kaam nahii. N hogaa

Languages Different languages have different mechanisms/devices to encode information Some devices are common across Languages Different languages have different mechanisms/devices to encode information Some devices are common across certain languages and some are different There alternative ways of expressing the same meaning within the same language Languages show preferences for one device over the others English exploits ‘position’ for encoding information Hindi uses ‘words’ more effectively Thus, differences in grammatical structures

Ambiguity in Natural Language (1/2) Look at the word 'plot' in the following examples Ambiguity in Natural Language (1/2) Look at the word 'plot' in the following examples (a) The plot having rocks and boulders is not good. (b) The plot having twists and turns is interesting. 'plot' in (a) means 'a piece of land' and in (b) 'an outline of the events in a story'

Ambiguity in Natural Language (2/2) Lexical level Sentence level Structural differences between SL and Ambiguity in Natural Language (2/2) Lexical level Sentence level Structural differences between SL and TL in a Machine Translation system.

Lexical ambiguity can be both for Content words – nouns, verbs etc Function words Lexical ambiguity can be both for Content words – nouns, verbs etc Function words – prepositions, TAMs etc Content words' ambiguity is of two types Homonymy Polysemy

Homonymy A word has two or more unrelated senses Example : I was walking Homonymy A word has two or more unrelated senses Example : I was walking on the bank (river-bank) I deposited the money in the bank (moneybank)

Polysemy A word having two or more related senses Example : English word 'issue', Polysemy A word having two or more related senses Example : English word 'issue', noun 1. The issue is under discussion (muddaa) 2. The latest issue of the journal is out (a. Nka) 3. He buys stamps on the day of the issue (vimocan) 4. The couple has no issue even after five years of marriage (sa. Ntaan)

Information Flow and Ambiguity 1. He scratched a figure on the rock (engrave) 2. Information Flow and Ambiguity 1. He scratched a figure on the rock (engrave) 2. She scratched the figure on the rock (scrape) • Other words in the context make a difference • Change of 'a' (in 1) to 'the' (in 2) changes the meaning of 'scratched'

Function words can also pose problems (1/4) Function words can also be ambiguous For Function words can also pose problems (1/4) Function words can also be ambiguous For example – English preposition 'in' (a) I met him in the garden mai. N usase bagiice mei. N milaa (b) I met him in the morning mai. N usase subaha 0 milaa 'Ambiguity' here refers to the 'appropriate correspondence' in the target language.

Function words can also pose problems (2/4) 1. He bought a shirt with tiny Function words can also pose problems (2/4) 1. He bought a shirt with tiny collars. usane chote kaular vaalii kamiiz khariidii ‘he tiny collars with shirt bought’ ‘with’ gets translated as ‘vaalii’ in Hindi 2. He washed a shirt with soap. usane saabun se kamiiz dhoii ‘he soap with shirt washed’ ‘with’ gets translated as ‘se’.

Function words can also pose problems (3/4) TAM Markers mark tense, aspect and modality Function words can also pose problems (3/4) TAM Markers mark tense, aspect and modality – Consist of inflections and/or auxiliary verbs in Hindi – An important source of information – Narrow down the meaning of a verb (eg. lied, lay)

Function words can also pose problems (4/4) English Simple Past vs Habitual' 1 a. Function words can also pose problems (4/4) English Simple Past vs Habitual' 1 a. He stayed in the guest house during his visit to our University in Jan (rahaa) 1 b. He stayed in the guest house whenever he visited us (rahataa thaa) 2 a. He went to the school just now (gayaa) 2 b. He went to the school everyday (jaataa thaa)

Sentence level ambiguity I met the girl in the store + Possible readings a) Sentence level ambiguity I met the girl in the store + Possible readings a) I met the girl who works in the store b) I met the girl while I was in the store Time flies like an arrow. + Possible parses: a) Time flies like an arrow (N V Prep Det N) b) Time flies like an arrow (N N V Det N) c) Time flies like an arrow (V N Prep Det N) (flies are like an arrow) d) Time flies like an arrow (V N Prep Det N) (manner of timing)

Thus, Languages encode information differently Languages code information only partially Tension between BREVITY and Thus, Languages encode information differently Languages code information only partially Tension between BREVITY and PRECISION Brevity wins leading to inherent ambiguity at different levels

Human beings use World knowledge Context (both linguistic and extra-linguistic) Cultural knowledge and Language Human beings use World knowledge Context (both linguistic and extra-linguistic) Cultural knowledge and Language conventions to resolve ambiguities Can all this knowledge be provided to the machine ? Computational Linguistics aims for this.

How to provide this knowledge ? (1/2) Analyse language at various levels (word, phrase, How to provide this knowledge ? (1/2) Analyse language at various levels (word, phrase, sentence etc) Build Tools for analysing the natural language at various levels in a text POS tagger (category marking) Morphological analysers (analysis of a word) Morphological generators (word generators) Chunkers (shallow parsers) Parsers (syntactic analysis) Filters (markers for special expressions) Sense Disambiguation Algorithms Etc The tools need linguistic knowledge

How to provide this knowledge ? (2/2) Build language resources Machine Readable Lexicon Rules How to provide this knowledge ? (2/2) Build language resources Machine Readable Lexicon Rules for various levels of linguistic analysis Computational Grammars Mapping rules for the concerned language pair for an MT system Sense Disambiguation Rules Annotated corpora Etc

POS Tagger What is a POS? Take the following English sentence My old friend POS Tagger What is a POS? Take the following English sentence My old friend Ram recently bought a book on Indian snakes for his cousin from London from the new bookshop . Each word in the above sentence belongs to a word class (also called as a Part Of Speech (POS)) The class to which a word may belong is based on its morphological and syntactic behavior Morphological Kind of affixes a word takes, for example, boys; girl, girls; book, books (noun class) Syntactic How it is distributed in a sentence He chairs the next session (verb) The chairs are new (noun)

Why is POS relevant in CL/NLP ? (1/2) • • Word class information of Why is POS relevant in CL/NLP ? (1/2) • • Word class information of a given word in a sentence helps to predict its neighbour WSD He runs a mile every day (verb) Their team made 250 runs (noun) Time flies like an arrow (n v prep det n) • Helps in further processing – chunking, morph pruning, sentence parsing • IR

POS tagged sentence My pronoun old friend Ram recently bought a book on Indian POS tagged sentence My pronoun old friend Ram recently bought a book on Indian snakes possesive adjective noun proper noun adverb determiner noun preposition adjective noun his possesive pronoun cousin noun from preposition London proper noun , punctuation from preposition the determiner new adjective bookshop noun

POS Tagging Approaches Rule Based Statistical Transformation Based POS Tagging Approaches Rule Based Statistical Transformation Based

Rule Based POS Tagging Two staged architecture algorithms (Harris, 1962; Klein and Simmons, 1963; Rule Based POS Tagging Two staged architecture algorithms (Harris, 1962; Klein and Simmons, 1963; Green and Rubin, 1971) Stage 1 dictionary assign POS by referring to the Eg Dictionary entry for Eng word that Conj, Adv, Pronoun Stage 2 disambiguate, using manually crafted rules

Statistical Taggers use probabilities for tagging The tagger picks the most likely tag for Statistical Taggers use probabilities for tagging The tagger picks the most likely tag for a given word in a context HMM based algorithms are most commonly used for POS tagging task Requires manually tagged corpus

Annotating Corpus for POS Annotated corpora is useful for developing statistical POS taggers Tagging Annotating Corpus for POS Annotated corpora is useful for developing statistical POS taggers Tagging scheme Set of POS Tags Guidelines for the annotators The tagged corpora should be High quality (in terms of tagging accuracy) Consistent

POS Tags for English Penn Tree Bank – 45 tags C 5 - Lancaster POS Tags for English Penn Tree Bank – 45 tags C 5 - Lancaster – 61 tags – used in CLAWS Basic tagset used for BNC http: //view. byu. edu/bnc_tags. htm - C 7 – 147 tags – Leech http: //www. comp. lancs. ac. uk/ucrel/claws 7 tags. html

Pen Treebank Tags My old friend Ram recently bought a book on Indian snakes Pen Treebank Tags My old friend Ram recently bought a book on Indian snakes PP$ JJ NN NNP RB VBD DT NN IN JJ NNS his cousin from London , , from the new bookshop in IN town PP$ NN IN NNP IN DT JJ NN NN

POS Tags for Indian Languages Objective To arrive at a standard POS and Chunk POS Tags for Indian Languages Objective To arrive at a standard POS and Chunk tagging scheme for all Indian languages Assumption Commonality in Indian Languages

Issues in Tag Set Design (1/2) Linguistic knowledge coarse vs fine Syntactic function vs Issues in Tag Set Design (1/2) Linguistic knowledge coarse vs fine Syntactic function vs lexical category (for POS tags) New tags vs tags close to existing English tags Should be comprehensive/complete

Issues in Tag Set Design (2/2) Simple Less effort in manual tagging Number of Issues in Tag Set Design (2/2) Simple Less effort in manual tagging Number of tags Common for all Indian languages

Linguistic Knowledge : Fine vs Coarse (1/2) Example Only noun (NN) la. Dak. A, Linguistic Knowledge : Fine vs Coarse (1/2) Example Only noun (NN) la. Dak. A, la. Dake, la. Dako. M, la. Dak. I, la. Dakiy. AM, ladakiyo. M OR Noun with gender, number, case information (NNM) ladak. A, lad. Ake, la. Dako. M, (NNMS) ladak. A, la. Dake (NNMP) la. Dake, la. Dko. M, (NNMSD) la. Dak. A, (NNMSO) la. Dake, (NNMPD) la. Dake, (NNMPO) la. Dako. M The decision has implications for the size of corpora and machine learning

Linguistic Knowledge : Fine vs Coarse (2/2) Alternatives Coarse - NN (advantages/disadvantages) Fine - Linguistic Knowledge : Fine vs Coarse (2/2) Alternatives Coarse - NN (advantages/disadvantages) Fine - NNMSD (advantages/disadvantages) Hierarchical Example: NN_m_sg_d Hierarchical tag set provides the possibility for underspecification

Considerations POS tagger is NOT a replacement for a morph analyzer Coarse analysis to Considerations POS tagger is NOT a replacement for a morph analyzer Coarse analysis to begin with Expandable if needed If the information can be obtained from elsewhere, it need not be included in the POS tag

Syntactic function vs lexical category Example harijana b. Alaka ‘harijan’ ‘child’ Decision : Lexical Syntactic function vs lexical category Example harijana b. Alaka ‘harijan’ ‘child’ Decision : Lexical category Helps achieve Consistency in annotation Better learning

New tags vs tags close to existing English tags New tags Noun, Pron, Adj, New tags vs tags close to existing English tags New tags Noun, Pron, Adj, Adv Familiar tags (Penn Treebank tags) NN, PRP, JJ, RB Decision : Penn tags for common lexical types New tags for certain IL specific

Comprehensive/Complete All the lexical items occurring in a sentence should be marked for their Comprehensive/Complete All the lexical items occurring in a sentence should be marked for their POS, including punctuations. If the language has some special cases, these should also be captured – Reduplications in ILs

Simple Why simple ? The tags are designed for some manual annotation Ease of Simple Why simple ? The tags are designed for some manual annotation Ease of learning Consistency in annotation

Less Effort in Manual Tagging The annotators should not have to Write too much Less Effort in Manual Tagging The annotators should not have to Write too much Take too many steps in annotating a lexical item

Number of Tags Number of tags makes a difference both for the man and Number of Tags Number of tags makes a difference both for the man and the machine For the man in decision making For the machine in learning for automatic tagging

Common for All Indian Languages Indian languages belong to various language families Share linguistic Common for All Indian Languages Indian languages belong to various language families Share linguistic features However, There are differences Some languages have quotatives, some don't Some have classifiers, some don't

Chunking What forms a chunk ? Non-recursive phrase ((det adj noun)) Partial structure without Chunking What forms a chunk ? Non-recursive phrase ((det adj noun)) Partial structure without distorting the dependencies Include inflections (postposition/auxiliaries) with a lexical category Example : ((mere cho. Te bhaaii ne))_NP ((jaa rahaa hai))_VG

Chunker A Chunker automatically groups words in a sentence as chunks and labels them Chunker A Chunker automatically groups words in a sentence as chunks and labels them ((My old friend Ram))_NP ((recently bought))_VG ((a book))_NP on ((Indian snakes))_NP for ((his cousin))_NP from ((London))_NP from ((the new bookshop))_NP.

IL Chunk Tags (1/2) NP JJP RBP NEGP CCP BLK noun chunk bahut acchii. IL Chunk Tags (1/2) NP JJP RBP NEGP CCP BLK noun chunk bahut acchii. I kitaab adjective chunk bahut sundar sii adverb chunk dhi. Ire – dh. Iire chunk for negatives nahii. N conjunct chunks raam Ora shyaam miscellaneous interjections etc

IL Chunk Tags (2/2) VGF Finite verb chunk jaa rahaa hai VGNF Non finite IL Chunk Tags (2/2) VGF Finite verb chunk jaa rahaa hai VGNF Non finite verb chunk jaate hue VGINF Infinitive verb chunk jaanaa VGNN Gerunds jaanaa FRAGP Discontiguous fragments of a chunk raama (meraa bhaaii) ne

Some Issues How to chunk the following ? Adverbs within a verb chunk or Some Issues How to chunk the following ? Adverbs within a verb chunk or separately Eg ((recently bought)) or ((recently)) ((bought)) Punctuations Particles – hii (only), to, bhii (also) etc

Current approach For punctuation – chunk them with the preceding chunk Adverbs – chunk Current approach For punctuation – chunk them with the preceding chunk Adverbs – chunk them separately Particles – chunk them with the chunk to which they belong ((raam ne bhii)) ((jaa hii rahaa thaa))

Issues • Verb Negation 1. nahii. N jaa rahaa ‘not going’ 2. kahaa hii Issues • Verb Negation 1. nahii. N jaa rahaa ‘not going’ 2. kahaa hii nahii. N ‘just did not mention’ 3. kaha to nahii. N rahaa thaa ‘was not saying’ (emphatic) 4. binaa yaha baata kahe ‘without saying this’ 5. yahii nahii. N, balki likhita ruup mei. N bhii yah miltaa hai ‘Not only this, in fact, this is also found in writing'

Current approach For cases 1 to 3, chunk NEG with the verb group For Current approach For cases 1 to 3, chunk NEG with the verb group For 4, chunk the NEG separately in a chunk For 5, also a separate NEGP chunk will work NOUN NEGATION ? ? ?

Chunking Co-ordinate Constructions 1. word 1 CC word 2 raam aur shyaam ((raam))_NP ((aur))_CCP Chunking Co-ordinate Constructions 1. word 1 CC word 2 raam aur shyaam ((raam))_NP ((aur))_CCP ((shyaam))_NP 2. phrase CC phrase meraa bhaaii shyaam aur tumhaaraa bhaaii mohan ((meraa bhaaii shyaam))_NP ((aur))_CCP ((tumhaaraa bhaaii mohan))_NP 3. clause CC clause

Discontiguous Phrases What about cases such as ' X (Y) Z' ? where X Discontiguous Phrases What about cases such as ' X (Y) Z' ? where X = noun, Y = a phrase, Z = postposition raam (meraa xillii vaalaa bhaaii) ne OR isa 'upanyaas – samraa. T' shabda kaa' FRAGP

Chunking Conjunct Verbs Conjunct verbs A verb composed of a noun/adj and a verb Chunking Conjunct Verbs Conjunct verbs A verb composed of a noun/adj and a verb (sviikaar karnaa 'accept') Should the conjunct verbs be tagged as a single chunk or two chunks? 'praw. Ik. SA karan. A', 'k. Sam. A karan. A' etc ‘to wait’ ‘to forgive’

What about genitives ? raam kaa betaa 'brother of Ram' usakaa betaa 'his/her son' What about genitives ? raam kaa betaa 'brother of Ram' usakaa betaa 'his/her son' mere bhaaii raam kaa betaa 'my brother Ram's son' iske pahale 'before this' mez ke uupar 'above/on the table' ravi ke saath 'with Ravi'

Chunking Numbers/Quantifiers (1/2) Numerals, quantifiers may occur as follows a) ek la. Dakaa 'one Chunking Numbers/Quantifiers (1/2) Numerals, quantifiers may occur as follows a) ek la. Dakaa 'one boy' b) 1 la. Dakaa '1 boy' c) pahalaa la. Dakaa 'first boy' d) karo. Do. N log 'billions of people' e) 1962 mei. N 'in 1962'

Chunking Numbers/Quantifiers (2/2) The POS tags for numerals and quantifiers are QC (numerals) and Chunking Numbers/Quantifiers (2/2) The POS tags for numerals and quantifiers are QC (numerals) and QF (other quantifiers) in IL POS tagset Example (d) and (e) in the previous slide show cases where the quantifier is behaving like a noun The issue : Should the quantifiers in cases such as (d) and (e) be tagged as a Q* or as NN since the chunk itself is a noun chunk ?

Summary For annotating POS and Chunk a scheme needs to be designed While doing Summary For annotating POS and Chunk a scheme needs to be designed While doing so following issues need to be considered. Definition of 'chunk' Elements which together can form a chunk type Whether to include postpositions, punctuations etc inside a chunk or form them as independent chunks POS/Chunk tag labels

Approaches in Computational Linguistics (for Tools) Two major approaches Rule based Requires manually crafted Approaches in Computational Linguistics (for Tools) Two major approaches Rule based Requires manually crafted rules Explicit linguistic knowledge Needs manual time and effort Trained manpower High precision Less robust

Approaches in Computational Linguistics (for Tools) Data driven approach Uses statistical methods or machine Approaches in Computational Linguistics (for Tools) Data driven approach Uses statistical methods or machine learning Requires less human effort Often requires large scale data sources (manually annotated corpora, lexicons etc) Linguistic knowledge is implicit More adaptive to noisy text More robust

Computational Linguistics Application Areas Is useful for Communication between Man-machine Question answering systems, interactive Computational Linguistics Application Areas Is useful for Communication between Man-machine Question answering systems, interactive railway reservation Text summarization Web applications Intelligent search engines Cross lingual search Man – man Machine translation