Скачать презентацию Natural Language Processing Morphology winter Скачать презентацию Natural Language Processing Morphology winter

30e8ceeb46fbd0e714069216a63da9d4.ppt

  • Количество слайдов: 106

Natural Language Processing >> Morphology << winter / fall 2012/2013 41. 4268 Prof. Dr. Natural Language Processing >> Morphology << winter / fall 2012/2013 41. 4268 Prof. Dr. Bettina Harriehausen-Mühlbauer Univ. of Applied Science, Darmstadt, Germany https: //www. fbi. h-da. de/organisation/personen/harriehausen-muehlbauer-bettina. html Bettina. Harriehausen@h-da. de WS 2012/2013 - Natural Language Systems Harriehausen

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries (MWE) 5 spell aid 6 regular expressions 7 Finite State Automata (FSA) WS 2012/2013 - Natural Language Systems Harriehausen 2

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries (MWE) 5 spell aid 6 regular expressions 7 Finite State Automata (FSA) WS 2012/2013 - Natural Language Systems Harriehausen 3

definition Morphemes morpheme = smallest possible item in a language that carries meaning • definition Morphemes morpheme = smallest possible item in a language that carries meaning • lexeme (man, house, dog, . . . ) • inflectional affixes (dog-s, want-ed, . . . ) • other affixes (pre-/in-/suff-): unwanted, atypical, antipathetic, . . . esp. in technical language (-itis = „infection“, gastro = stomach. . . gastroenteritis) WS 2012/2013 - Natural Language Systems Harriehausen 4

morphemes WS 2012/2013 - Natural Language Systems Harriehausen 5 morphemes WS 2012/2013 - Natural Language Systems Harriehausen 5

morphemes free morphemes : stand-alone, carry lexical and morphological meaning (e. g. house= sing, morphemes free morphemes : stand-alone, carry lexical and morphological meaning (e. g. house= sing, neuter, nominative ; case/number/gender) bound morphemes : legal wordform only in combination with another morpheme, stand-alone, carry lexical and morphological meaning. Various combinations exist: bound + free: e. g. un-happy, all bound: e. g. gastro-enter-itis WS 2012/2013 - Natural Language Systems Harriehausen 6

morphemes inflectional morphemes : create words and carry morphological meaning (e. g. dogs, laughed, morphemes inflectional morphemes : create words and carry morphological meaning (e. g. dogs, laughed, going derivational morphemes : create wordforms and carry morphological meaning ( happily, intellectually, instruction, instructor, insulator, the pounding, limpness, blindness. . . ) Question: which string (~morpheme) do we include in our dictionary ? • full form dictionary vs. • base form dictionary (lemmas) WS 2012/2013 - Natural Language Systems Harriehausen 7

content 1 morphemes 2 compounds / concatenation / decompounding 3 idiomatic phrases 4 multiple content 1 morphemes 2 compounds / concatenation / decompounding 3 idiomatic phrases 4 multiple word entries (MWE) 5 spell aid 6 regular expressions 7 Finite State Automata (FSA) WS 2012/2013 - Natural Language Systems Harriehausen 8

compounds / concatenation Definition: a compound is a lexeme that consists of more than compounds / concatenation Definition: a compound is a lexeme that consists of more than one stem. Compounding or composition is the word formation that creates compound lexemes (= compounds). There is no clear upper limit in number of roots allowed in English compounds. It usually doesn‘t exceed 3 morphemes, but it is clearly a stylistic issue. Some compounds are written as one word: blackbird. Some are written with hyphens: mother-in-law. Most are written as separate words: smoke screen. Question: What do we put into our dictionary ? Typically not spelling, but stress and word-internal sound rules distinguish compounds from non-compounds: Compare white house with White House. WS 2012/2013 - Natural Language Systems Harriehausen 9

compounds / concatenation Compounding follows rules. e. g. from chemical compounds. (http: //www. chem. compounds / concatenation Compounding follows rules. e. g. from chemical compounds. (http: //www. chem. qmul. ac. uk/iupac/) Substitutive nomenclature This naming method generally follows established IUPAC organic nomenclature. E. g. : Hydrides of the main group elements (groups 13– 17) are given -ane base names, e. g. borane (BH 3), oxidane (H 2 O), phosphane (PH 3). The compound PCl 3 would be named substitutively as trichlorophosphane. Additive nomenclature This naming method has been developed principally for coordination compounds. An example of its application is: [Co. Cl(NH 3)5]Cl 2 pentaamminechloridocobalt(III) chloride WS 2012/2013 - Natural Language Systems Harriehausen 10

Example of a chemical compound Components of Phane Parent Names bicyclo[8. 6. 0]hexadecaphane • Example of a chemical compound Components of Phane Parent Names bicyclo[8. 6. 0]hexadecaphane • • The prefix "bicyclo" indicates that there are two rings (bi-cyclo). The bridge descriptor describes the ring structure in terms of a sixteen-membered main ring [8 + 6 + 2 (the bridgehead nodes)] with a bridge consisting of a bond, i. e. , zero nodes, which divides the main ring into an eight-membered and a ten-membered ring. The numerical term "hexadeca" denotes the presence of sixteen skeletal nodes. and the term "phane" indicates that at least one node represents a multiatomic (cyclic) structural unit. [http: //www. chem. qmul. ac. uk/iupac/phane/Ph. I 2. html] WS 2012/2013 - Natural Language Systems Harriehausen 11

WS 2012/2013 - Natural Language Systems Harriehausen 12 WS 2012/2013 - Natural Language Systems Harriehausen 12

Example of a medical compound Medical compounds are usually composed of a prefix + Example of a medical compound Medical compounds are usually composed of a prefix + root + suffix, where neither of the components can be used stand-alone. nephritis: supra-renal: nephrologist: gastroenteritis : nephrgastr- inflammation of the kidney situated above the kidneys a kidney doctor inflammation of stomach and intestines 2 roots: Greek (νεφρός nephr(os)) , Latin (ren(es)). ancient Greek γαστήρ (gastēr), γαστρ- -o- linking 2 body parts (linguistically) enter- ancient Greek ἔντερον (énteron) -itis supra- ologist WS 2012/2013 - Natural Language Systems Harriehausen = kidney = stomach, belly = intestine = inflammation = above = person studying a certain body part 13

WS 2012/2013 - Natural Language Systems Harriehausen 14 WS 2012/2013 - Natural Language Systems Harriehausen 14

compounds / concatenation formation of compounds: synthesis and agglutination Compound formation rules vary widely compounds / concatenation formation of compounds: synthesis and agglutination Compound formation rules vary widely across language types. Examples of formation processes (usually linked to the language type): • synthesis (typically with synthetic languages, i. e. languages with a high morpheme-per-word ratio): e. g. German: Kapitänspatent = Kapitän (sea captain) + Patent (license) joined by an -s- (originally a genitive case suffix); „patent of a sea captain“ Latin: paterfamilias = pater (father) + familias (genitive of the lexeme familia (family)); „father of a family“ WS 2012/2013 - Natural Language Systems Harriehausen 15

compounds / concatenation formation of compounds: It can get more difficult: (German -> English) compounds / concatenation formation of compounds: It can get more difficult: (German -> English) Aufsichtsratsmitgliederversammlung => Auf = on sicht+s =view + “Fuge-s“ Notice: rat+s = council + „genitive-s“ "with" and "link" form a derivation that is mit = with the German word for "member"; glied + er = link + „plural“ "completion", "collect" and "noun" form a ver = „completion“ derivation that means "meeting" samml (stem = sammeln) = collect ung = „noun“ On-view-council-with-link-collect ? ? ? ? ? = "meeting of members of the supervisory board" WS 2012/2013 - Natural Language Systems Harriehausen 16

compounds / concatenation formation of compounds: synthesis and agglutination • agglutination (usually with agglutinative compounds / concatenation formation of compounds: synthesis and agglutination • agglutination (usually with agglutinative languages, which tend to create very long words with derivational morphemes), e. g. German Farbfernsehgerät = color television set Funkfernbedienung = radio remote control Donaudampfschifffahrtsgesellschaftskapitänsmütze = Danube steamboat Finnish shipping company Captain's hat hätä-uloskäytävä = emergency exit Lentokone-suihku-turbiini-moottori-apu-mekaanikko-aliupseeri-oppilas Swedish = Airplane jet turbine engine auxiliary mechanic non-commissioned officer student rörelseuppskattningssökintervallsinställningar = Motion estimation search range settings WS 2012/2013 - Natural Language Systems Harriehausen 17

Samples for long compounds in German • die Armbrust • die Mehrzweckhalle • das Samples for long compounds in German • die Armbrust • die Mehrzweckhalle • das Mehrzweckkirschentkerngerät • die Gemeindegrundsteuerveranlagung • die Nummernschildbedruckungsmaschine • der Mehrkornroggenvollkornbrotmehlzulieferer • der Schifffahrtskapitänsmützenmaterialhersteller • die Verkehrsinfrastrukturfinanzierungsgesellschaft • die Feuerwehrrettungshubschraubernotlandeplatzaufseherin • der Oberpostdirektionsbriefmarkenstempelautomatenmechaniker • das Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz • die Donaudampfschifffahrtselektrizitätenhauptbetriebswerkbauunterbeamtengesellschaft Wolkenkratzer 'skyscraper': wolken 'clouds', + kratzer 'scraper' Eisenbahn 'railway': Eisen 'iron', + bahn 'track' Kraftfahrzeug 'automobile': Kraft 'power', + fahren/fahr 'drive', + zeug 'machinery' Stacheldraht 'barbed wire': stachel 'barb/barbed', + draht 'wire' Rinderkennzeichnungs- und Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz : literally, Cattle marking and beef labeling supervision duties delegation law WS 2012/2013 - Natural Language Systems Harriehausen 18

Samples for long compounds in different languages (see: http: //en. wikipedia. org/wiki/Compound_%28 linguistics%29) Chinese Samples for long compounds in different languages (see: http: //en. wikipedia. org/wiki/Compound_%28 linguistics%29) Chinese (Cantonese Jyutping): 學生 'student': 學 learn + 生 grow 太空 'universe': 太 t great + 空 emptiness 摩天樓 'skyscraper': 摩 touch + 天 sky + 樓 building (with more than 1 storey) 打印機 'printer': 打 strike + 印 stamp/print + 機 machine 百科全書 'encyclopaedia': 百 100 + 科 (branch of) study + 全 entire/complete + 書 book Dutch: Arbeidsongeschiktheidsverzekering 'disability insurance': arbeid 'labour', + ongeschiktheid 'inaptitude', + verzekering 'insurance'. Rioolwaterzuiveringsinstallatie 'wastewater treatment plant': riool 'sewer', + water 'water', + zuivering 'cleaning', + installatie 'installation'. Verjaardagskalender 'birthday calendar': verjaardag 'birthday', + kalender 'calendar'. Klantenservicemedewerker 'customer service representative': klanten 'customers', + service 'service', + medewerker 'worker'. Universiteitsbibliotheek 'university library': universiteit 'university', + bibliotheek 'library'. Doorgroeimogelijkheden 'possibilities for advancement': door 'through', + groei WS 2012/2013 - Natural Language Systems 19 Harriehausen 'grow', + mogelijkheden 'possibilities'.

Samples for long compounds in different languages (see: http: //en. wikipedia. org/wiki/Compound_%28 linguistics%29) Finnish: Samples for long compounds in different languages (see: http: //en. wikipedia. org/wiki/Compound_%28 linguistics%29) Finnish: sanakirja 'dictionary': sana 'word', + kirja 'book' tietokone 'computer': tieto 'knowledge, data', + kone 'machine' keskiviikko 'Wednesday': keski 'middle', + viikko 'week' maailma 'world': maa 'land', + ilma 'air' rautatieasema 'railway station': rauta 'iron' + tie 'road' + asema 'station' suihkuturbiiniapumekaanikkoaliupseerioppilas: 'Jet engine assistant mechanic NCO student' atomiydinenergiareaktorigeneraattorilauhduttajaturbiiniratasvaihde : some part of a nuclear plant Korean: 안팎 anpak 'inside and outside': 안 an 'inside' + 밖 bak 'outside‚ Spanish: Ciempiés 'centipede': cien 'hundred', + pies 'feet' Ferrocarril 'railway': ferro 'iron', + carril 'lane' Paraguas 'umbrella': para 'to stop, stops' + aguas '(the) water' WS 2012/2013 - Natural Language Systems Harriehausen 20

Samples for long compounds in different languages (see: http: //en. wikipedia. org/wiki/Compound_%28 linguistics%29) Icelandic: Samples for long compounds in different languages (see: http: //en. wikipedia. org/wiki/Compound_%28 linguistics%29) Icelandic: járnbraut 'railway': járn 'iron', + braut 'path' or 'way' farartæki 'vehicle': farar 'journey', + tæki 'apparatus' alfræðiorðabók 'encyclopædia': al 'everything', + fræði 'study' or 'knowledge', + orða 'words', + bók 'book' símtal 'telephone conversation': sím 'telephone', + tal 'dialogue' Italian: Millepiedi 'centipede': mille 'thousand', + piedi 'feet' Ferrovia 'railway': ferro 'iron', + via 'way' Tergicristallo 'windscreen wiper': tergere 'to wash', + cristallo 'crystal, glass' Japanese: 目覚まし(時計) mezamashi(dokei) 'alarm clock': 目 me 'eye' + 覚まし samashi (-zamashi) 'awakening (someone)' (+ 時計 tokei (-dokei) clock) お好み焼き okonomiyaki: お好み okonomi 'preference' + 焼き yaki 'cooking' 日帰り higaeri 'day trip': 日 hi 'day' + 帰り kaeri (-gaeri) 'returning (home)' 国会議事堂 kokkaigijidō 'national diet building': 国会 kokkai 'national diet' + 議事 giji 'proceedings' + 堂 dō 'hall' WS 2012/2013 - Natural Language Systems 21 Harriehausen

compounds / concatenation formation of compounds and their structure: Most compounds are 2 -root-compounds, compounds / concatenation formation of compounds and their structure: Most compounds are 2 -root-compounds, but they come with a number of different structures: Nouns – Adjectives - Verbs A. Nouns Noun-Noun Adjective-Noun Preposition-Noun Verb-Noun apron string high school overdose swearword hubcap smallpox underdog whetstone bedroom poorhouse uptone scrubwoman schoolteacher bluebird afterthought rattlesnake (see: http: //public. wsu. edu/~gordonl/S 05/256/compounds. htm) In each of these cases, the syntactic class of the compound is the same as the syntactic class of the final element of the compound. WS 2012/2013 - Natural Language Systems Harriehausen 22

compounds / concatenation formation of compounds and their structure: In each of these cases, compounds / concatenation formation of compounds and their structure: In each of these cases, the syntactic class* of the compound is the same as the syntactic class of the final element of the compound. * syntactic class = part-of-speech, such as noun, verb, adjective, … WS 2012/2013 - Natural Language Systems Harriehausen 23

compounds / concatenation formation of compounds and their structure: Noun-Noun Adjective-Noun Preposition-Noun Verb-Noun schoolteacher compounds / concatenation formation of compounds and their structure: Noun-Noun Adjective-Noun Preposition-Noun Verb-Noun schoolteacher bluebird afterthought rattlesnake In each of these cases, the syntactic class of the compound is the same as the syntactic class of the final element of the compound. Rule: • Germanic languages (e. g. English, German) are left-branching (the modifiers come before the head). Schoolteacher = teacher of a school, bluebird = bird of blue color • Romance languages ( e. g. French, Spanish) are usually rightbranching; i. e. they are often formed by left-hand heads with prepositional components inserted before the modifier: chemin-de-fer = railway (lit. 'road of iron') moulin à vent = windmill (lit. 'mill (that works)-by-means-of wind') WS 2012/2013 - Natural Language Systems Harriehausen 24

compounds / concatenation formation of compounds and their structure: B. Adjectives Noun-Adjective-Adjective Preposition-Adjective headstrong compounds / concatenation formation of compounds and their structure: B. Adjectives Noun-Adjective-Adjective Preposition-Adjective headstrong white-hot overwide skin-deep widespread ingrown nationwide bittersweet underripe earthbound hardworking above-mentioned (see: http: //public. wsu. edu/~gordonl/S 05/256/compounds. htm) In each of these cases, the syntactic class of the compound is the same as the syntactic class of the final element of the compound. WS 2012/2013 - Natural Language Systems Harriehausen 25

compounds / concatenation formation of compounds and their structure: B. Adjectives : hardworking The compounds / concatenation formation of compounds and their structure: B. Adjectives : hardworking The internal structure may be complex: hard + work + ing -> hardwork + ing OR hard + working - ing is typically the aspect-suffix that gets added to the verb (root): e. g. play-ing, laugh-ing, ask-ing, … As a rule, we can form other wordforms (inflections, due to different tenses) from those roots, following the same inflectional pattern, i. e. verbal root + tense-marking-suffix, or insertion of modal verb: Simple Present: Simple Past: Simple Future: WS 2012/2013 He play-s. He laugh-s. He ask-s. They play-ed. They laugh-ed. They ask-ed. I will play. I will laugh. I will ask. - Natural Language Systems Harriehausen 26

compounds / concatenation formation of compounds and their structure: B. Adjectives : hardworking The compounds / concatenation formation of compounds and their structure: B. Adjectives : hardworking The internal structure may be complex: hard + work + ing -> hardwork + ing OR hard + working * He hardworks. * They hardworked. * I will hardwork. -> hardwork + ing i. e. hardwork is not a verb by itself (see: http: //public. wsu. edu/~gordonl/S 05/256/compounds. htm) WS 2012/2013 - Natural Language Systems Harriehausen 27

compounds / concatenation formation of compounds and their structure: B. Adjectives : hardworking The compounds / concatenation formation of compounds and their structure: B. Adjectives : hardworking The internal structure may be complex: hard + work + ing -> hardwork + ing OR hard + working * He hardworks. * They hardworked. * I will hardwork. -> hardwork + ing Adj Adv Adj verb suffix hard work ing (see: http: //public. wsu. edu/~gordonl/S 05/256/compounds. htm) WS 2012/2013 - Natural Language Systems Harriehausen 28

compounds / concatenation formation of compounds and their structure: C. Verbs Noun-Verb Adjective-Verb Preposition. compounds / concatenation formation of compounds and their structure: C. Verbs Noun-Verb Adjective-Verb Preposition. Verb-Verb spoonfeed dry-clean outlive sleepwalk aircondition whitewash overdo window-shop broadcast uproot (see: http: //public. wsu. edu/~gordonl/S 05/256/compounds. htm) In each of these cases, the syntactic class of the compound is the same as the syntactic class of the final element of the compound. WS 2012/2013 - Natural Language Systems Harriehausen 29

semantics of compounds Semantic classification : it it common to classify compounds into 4 semantics of compounds Semantic classification : it it common to classify compounds into 4 types: • endocentric • exocentric • copulative • appositional description: A+B denotes a special kind of B Endocentric compounds consist of a head and modifiers, which restrict this meaning. Endocentric compounds tend to be of the same part of speech (word class) as their head. Examples: - doghouse, where house is the head and dog is the modifier; i. e. a house intended for a dog -darkroom, where dark modifies room; i. e. a type of a room (usually used in photography) WS 2012/2013 - Natural Language Systems Harriehausen 30

semantics of compounds Semantic classification : it it common to classify compounds into 4 semantics of compounds Semantic classification : it it common to classify compounds into 4 types: • endocentric • exocentric • copulative • appositional description: (one) whose B is A Exocentric compounds have an unexpressed semantic head (e. g. a person, a plant, an animal. . . ), and their meaning is often not transparent from its constituent parts. Examples: ●white-collar is neither a kind of collar nor a white thing, but the collar's colour is a metaphor for socioeconomic status ● red-neck only indirectly refers to a neck, but refers to a working person (e. g. farmer) ● skinhead, may refer to a bald head but also refers to a certain group of people ● paleface, native American Indians call the White Man a paleface WS 2012/2013 - Natural Language Systems Harriehausen 31

semantics of compounds Semantic classification : it it common to classify compounds into 4 semantics of compounds Semantic classification : it it common to classify compounds into 4 types: • endocentric • exocentric • copulative description: A+B denotes 'the sum' of what A and B denote • appositional Copulative compounds are compounds which have two semantic heads. Examples: - bittersweet; having both tastes - sleepwalk; sleeping while walking OR walking in your sleep WS 2012/2013 - Natural Language Systems Harriehausen 32

semantics of compounds Semantic classification : it it common to classify compounds into 4 semantics of compounds Semantic classification : it it common to classify compounds into 4 types: • endocentric • exocentric • copulative • appositional description: A and B provide different descriptions for the same referent; the meaning of which can be characterized as 'a AS WELL AS'. Appositional compounds refer to lexemes that have two (contrary) attributes which classify the compound. Examples: - actor-director; an actor who also plays the role of the director - maidservant; a maid who is also a servant OR a servant who is also a maid - Player-coach; someone who is a player as well as a coach WS 2012/2013 - Natural Language Systems Harriehausen 33

semantics of compounds (ambiguities) When - in Germanic languages (e. g. German, English) - semantics of compounds (ambiguities) When - in Germanic languages (e. g. German, English) - compound words are formed by prepending a descriptive word in front of the main word, the description or meaning between the components may be ambiguous. This is a problem for decompounding or translation. -> the orange bowl problem WS 2012/2013 - Natural Language Systems Harriehausen 34

semantics of compounds (ambiguities) Can you please bring me the orange bowl ? bowl semantics of compounds (ambiguities) Can you please bring me the orange bowl ? bowl filled with oranges ? ? ? bowl having the shape of an orange ? ? bowl with an bowl of orange colour bowl that was formerly / usually filled with oranges orange pattern WS 2012/2013 - Natural Language Systems Harriehausen 35

compounding - decompounding -> follows rules principles / rules: FANO rule: „the analysis is compounding - decompounding -> follows rules principles / rules: FANO rule: „the analysis is unambiguous, when a morpheme is not the beginning of another morpheme“ (= principle of longest match) e. g. but / butter (Orthographic) Ambiguities in segmentation : horseshoe: horses – hoe (? ) vs. horse-shoe (the FANO rule would lead to the incorrect/unlikely segmentation) Segmentation has to be done recursively in order to find all possibilities: WS 2012/2013 - Natural Language Systems Harriehausen 36

compounding - decompounding English: petshopping: pet-shopping vs. pets-hopping egg roll: Chinese food vs. rolling compounding - decompounding English: petshopping: pet-shopping vs. pets-hopping egg roll: Chinese food vs. rolling egg a green ´house vs. a ´greenhouse The white ´house vs. The ´White House WS 2012/2013 - Natural Language Systems Harriehausen 37

compounding - decompounding German: Staubecken: Stau-becken = a reservoir Staub-ecken = dusty corners Wachstube: compounding - decompounding German: Staubecken: Stau-becken = a reservoir Staub-ecken = dusty corners Wachstube: Wach-stube = die Stube einer Wache (the room of a guard) Wachs-tube = eine Tube, in der Wachs aufbewahrt wird (a tube filled with wax) Gelbrand: Gelb-rand = gelber Rand (a yellow border) Gel-brand = Brand eines Gels (burning of a gel) Tonerkennung: Toner-kennung = die Kennung eines Toners (the identifier of a toner) Ton-erkennung = das Erkennen von Tönen (the identification of tones) Lachen: Lache-n = mehrere Pfützen (multiple puddles of water) Lachen = eine menschliche Lautäußerung wie Gelächter (laughter) Druckerzeugnis: Druck-erzeugnis = Gedrucktes (printed matter) Drucker-zeugnis = Zeugnis für einen Drucker (certificate for a printer) beinhalten : bein-halten vs. be-inhalten (imagine: Beinhalten…. ) Abteilungen : Abtei-lungen vs. Abteil-ungen WS 2012/2013 - Natural Language Systems Harriehausen 38

compounding - decompounding context or stress (in spoken language) is needed for disambiguation WS compounding - decompounding context or stress (in spoken language) is needed for disambiguation WS 2012/2013 - Natural Language Systems Harriehausen 39

(problems with )concatenation Summary Structural as well as semantic challenges with compounds: • ambiguities (problems with )concatenation Summary Structural as well as semantic challenges with compounds: • ambiguities in meaning (orange bowl) • ambiguities in hyphenation points (Staubecken) • not all morphemes can form a compound (sheepchops)-> WS 2012/2013 - Natural Language Systems Harriehausen 40

(problems with )concatenation WS 2012/2013 - Natural Language Systems Harriehausen 41 (problems with )concatenation WS 2012/2013 - Natural Language Systems Harriehausen 41

compounds -> MWE -> idiomatic phrases WS 2012/2013 = increasing the idiomatic rigidity increasing compounds -> MWE -> idiomatic phrases WS 2012/2013 = increasing the idiomatic rigidity increasing the formal complexity In addition to the compounds that have one of the four descriptions (endocentric, exocentric, copulative, appositional), i. e. stick to the original lexical meaning of at least one of its components, we need to consider „multiple morpheme strings / multi word expressions (MWE)“ (fixed phrases) that have „lost“ the original lexical meaning of its components. Those MWE are called idiomatic phrases or idioms. • compounding: combination of lexical meanings: carseat, houseboat, cellar door, . . . • compounding: not a combination of the lexical meanings: starfish, paperback, ladybug, . . . • depending on the context: bite the dust, lose face, kick the bucket, . . . - Natural Language Systems Harriehausen 42

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries (MWE) 5 spell aid 6 regular expressions 7 Finite State Automata (FSA) WS 2012/2013 - Natural Language Systems Harriehausen 43

idiomatic phrases (http: //www. geo. de/GEOlino/mensch/redewendungen/englisch) • Out of the blue • To be idiomatic phrases (http: //www. geo. de/GEOlino/mensch/redewendungen/englisch) • Out of the blue • To be on Cloud Nine • A leopard cannot change its spots • Head over heels • Fair Play • As cool as a cucumber • The early bird catches the worm • As fit as a fiddle • Beat about the bush • The Big Apple • The apple of my eye • Wet behind the ears • A bird in the hand is worth two in the bush • It's raining cats and dogs WS 2012/2013 - Natural Language Systems Harriehausen 44

idiomatic phrases (http: //www. geo. de/GEOlino/mensch/redewendungen/deutsch) • Wie bei Hempels unterm Sofa • Schmetterlinge idiomatic phrases (http: //www. geo. de/GEOlino/mensch/redewendungen/deutsch) • Wie bei Hempels unterm Sofa • Schmetterlinge im Bauch • Jemanden übers Ohr hauen • Ein Bäuerchen machen • Mit jemandem durch dick und dünn gehen • Seine Pappenheimer kennen • Jemandem die Würmer aus der Nase ziehen • Die Arschkarte ziehen • Mit jemandem Pferde stehlen können • Sich aus dem Staub machen • Hummeln im Hintern haben • Im siebten Himmel sein • Viele Wege führen nach Rom • Mit einem lachenden und einem weinenden Auge • Nah am Wasser gebaut haben • Da ist der Bär los • Nachtigall, ick hör dir trapsen • Mein lieber Scholli! WS 2012/2013 - Natural Language Systems Harriehausen 45

idiomatic phrases (http: //www. geo. de/GEOlino/mensch/redewendungen/deutsch) • Jemandem einen Denkzettel verpassen • Sich auf idiomatic phrases (http: //www. geo. de/GEOlino/mensch/redewendungen/deutsch) • Jemandem einen Denkzettel verpassen • Sich auf den Schlips getreten fühlen • Alles für die Katz • Wo drückt denn der Schuh? • Gegen den Strich gehen • Den Faden verlieren • Etwas ausbaden müssen • Einen Stein im Brett haben • Bahnhof verstehen • Der springende Punkt • Der Sündenbock sein • Einen Ohrwurm haben • Das ist doch zum Mäusemelken! • Schmiere stehen • Den Teufel an die Wand malen • Auf dem Holzweg sein • Eselsbrücke • In der Kreide stehen WS 2012/2013 - Natural Language Systems Harriehausen 46

idiomatic phrases (http: //www. geo. de/GEOlino/mensch/redewendungen/deutsch) • Die Ohren steif halten • Auf Vordermann idiomatic phrases (http: //www. geo. de/GEOlino/mensch/redewendungen/deutsch) • Die Ohren steif halten • Auf Vordermann bringen • Um die Ecke bringen • Hals- und Beinbruch • Auf dem Kerbholz haben • Eine Schlappe einstecken • Frosch im Hals • Es zieht wie Hechtsuppe • Jemandem einen Bärendienst erweisen • Damoklesschwert • Tomaten auf den Augen haben • Jemandem raucht der Kopf • Für 'n Appel und 'n Ei • Etwas an die große Glocke hängen • Das ist Jacke wie Hose • Etwas aus dem Ärmel schütteln • Ein X für ein U vormachen • Jemandem nicht das Wasser reichen können WS 2012/2013 - Natural Language Systems Harriehausen 47

idiomatic phrases (http: //www. geo. de/GEOlino/mensch/redewendungen/deutsch) • Alles im grünen Bereich • Die Hand idiomatic phrases (http: //www. geo. de/GEOlino/mensch/redewendungen/deutsch) • Alles im grünen Bereich • Die Hand ins Feuer legen • Das kann kein Schwein lesen! • Auf Draht sein • Sein blaues Wunder erleben • Der hat es faustdick hinter den Ohren • Mein Name ist Hase, ich weiß von nichts • Aus dem Stegreif • Der Groschen ist gefallen • Einen Vogel haben • Den Kürzeren ziehen • Bis in die Puppen • Etwas hinter die Ohren schreiben • Ins Fettnäpfchen treten • Beleidigte Leberwurst • Jemanden auf dem Kieker haben • Ich verstehe immer nur Bahnhof! • Die Katze im Sack kaufen WS 2012/2013 - Natural Language Systems Harriehausen 48

idiomatic phrases (http: //www. geo. de/GEOlino/mensch/redewendungen/deutsch) • Bekannt wie ein bunter Hund • Den idiomatic phrases (http: //www. geo. de/GEOlino/mensch/redewendungen/deutsch) • Bekannt wie ein bunter Hund • Den Kopf in den Sand stecken • Mit dem ist nicht gut Kirschen essen • Aller guten Dinge sind drei • Lampenfieber • Das kommt mir spanisch vor • Schwein haben • Das hast du dir selbst eingebrockt • Seinen Senf dazugeben • Jemandem ist eine Laus über die Leber gelaufen • Kalte Füße bekommen • Im Stich lassen • Schwedische Gardinen • Alles in Butter • Geld auf den Kopf hauen • Das Handtuch werfen • Sich mit fremden Federn schmücken WS 2012/2013 - Natural Language Systems Harriehausen 49

idiomatic phrases – and their morpho-syntax Idiomatic expressions are extremely rigid, in that morphosyntactic idiomatic phrases – and their morpho-syntax Idiomatic expressions are extremely rigid, in that morphosyntactic modifications are not allowed (without a change in meaning) : GERMAN Singular - Plural • Bekannt wie ein bunter Hund • ? ? ? Bekannt wie bunte Hunde. • * Bekannt wir 2 bunte Hunde. adjectival modification • Den Kopf in den Sand stecken. • Den Kopf in den weichen Sand stecken. WS 2012/2013 - Natural Language Systems Harriehausen 50

idiomatic phrases – and their morpho-syntax Idiomatic expressions are extremely rigid, in that morpho-syntactic idiomatic phrases – and their morpho-syntax Idiomatic expressions are extremely rigid, in that morpho-syntactic modifications are not allowed (without a change in meaning) : ENGLISH Adjectival modification: • to be on cloud nine –> * to be on cloud eight Singular – Plural: • The early bird gets the worm. -> ? The early birds get the worm. • It's raining cats and dogs. -> * It's raining 2 cats and 3 dogs. Neither adjectival modification nor change of subject: • He kicked the bucket. • * He kicked the green bucket. • * It kicked the bucket. WS 2012/2013 - Natural Language Systems Harriehausen 51

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries (MWE) – and their relationship 5 spell aid 6 regular expressions 7 Finite State Automata (FSA) WS 2012/2013 - Natural Language Systems Harriehausen 52

multiple word entries (MWE) We have already looked at the semantics / meaning of multiple word entries (MWE) We have already looked at the semantics / meaning of compounds and idioms. But what about the relationship within the MWE ? WS 2012/2013 - Natural Language Systems Harriehausen 53

multiple word entries (MWE) Problems: the relationships among the components change the „Schnitzel“ problem multiple word entries (MWE) Problems: the relationships among the components change the „Schnitzel“ problem • Schweineschnitzel / -steak • Pfefferschnitzel / -steak • Wienerschnitzel • Soyaschnitzel • Rückensteak, Lendensteak, Ribeyesteak • Minutenschnitzel / -steak • Jäger Schnitzel • Zigeuner Schnitzel • Tiefkühl-Schnitzel WS 2012/2013 - Natural Language Systems Harriehausen 54

multiple word entries (MWE) Problems: the relationships among the components change the „Schnitzel“ problem multiple word entries (MWE) Problems: the relationships among the components change the „Schnitzel“ problem • Schweineschnitzel / -steak made of pork meat • Pfefferschnitzel / -steak garnished / spiced with pepper • Wienerschnitzel a certain recipe • Soyaschnitzel made of soy • Rückensteak, Lendensteak, Ribeyesteak body part • Minutenschnitzel / -steak time / length of cooking • Jäger Schnitzel a certain recipe • Zigeuner Schnitzel a certain recipe • Tiefkühl-Schnitzel status (frozen) WS 2012/2013 - Natural Language Systems Harriehausen 55

multiple word entries (MWE) Problems: the relationships among the components change the „Schnitzel“ problem multiple word entries (MWE) Problems: the relationships among the components change the „Schnitzel“ problem Even though the single lexical meanings remain untouched in the compound, the relationships between the compounds vary tremendously ! WS 2012/2013 - Natural Language Systems Harriehausen 56

multiple word entries (MWE) the 3 main relationships (default ? ) between parts of multiple word entries (MWE) the 3 main relationships (default ? ) between parts of a compound word: (the role of global knowledge in decompounding) compound meaning relationship doorknob carseat glasdoor nutbread waterglas oiltruck is-a / is-part-of/ genitive made from / material WS 2012/2013 knob of the door seat of the car door made of glas ‡ bread of the nut glas filled with water truck that carries oil ‡ truck made of oil used for - Natural Language Systems Harriehausen 1 2 3 57

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries (MWE) 5 spell aid 6 regular expressions 7 Finite State Automata (FSA) WS 2012/2013 - Natural Language Systems Harriehausen 58

spell aid in NLP, decompounding algorithms are essential for spell- checking / spell aid spell aid in NLP, decompounding algorithms are essential for spell- checking / spell aid : How do we define a lexical error in NLP terms ? An error is a string that cannot be found in / matched with a dictionary entry. It is not necessarily an incorrect word (esp. neologisms). WS 2012/2013 - Natural Language Systems Harriehausen 59

spell aid Neologism (Definition): A neologism is a new term, word or phrase, that spell aid Neologism (Definition): A neologism is a new term, word or phrase, that may or may not be in the process of entering common use, but has not yet been accepted into mainstream language, i. e. it has NOT entered written dictionaries (yet). For a long time neologisms were mainly seen as pathological or deviating - Webster’s Third New International Dictionary (1966) describes neologism as „a meaningless word coined by a psychotic“. http: //www. neologisms. us/ a-er aagram string aangram Aazymurgy abasure abberateur WS 2012/2013 abbrantcooty abbrhyme abched abilliant abomasum abrabro abrickity abthurt - Natural Language Systems Harriehausen 60

spell aid - neologisms http: //www. wortwarte. de/ Neue Wörter vom 25. 9. 2011 spell aid - neologisms http: //www. wortwarte. de/ Neue Wörter vom 25. 9. 2011 Heute servieren wir Ihnen 23 neue Wörter: Alles-Apparat, der Ampelorgie, die ärzteloyal, Adjektiv Distanzmanöver, das Drivingcenter, das E-Ball-Match, das Ego-Archäologe, der Full-Flat, die Gefällt-mir-Klick, der Geschmacksfarbe, die HD-Livestream, der Inlineskater-Marathon, der Neue Wörter vom 25. 9. 2011 Heute servieren wir Ihnen 23 neue Wörter: WS 2012/2013 Leerheitsanalyse, die mitnahmefähig, Adjektiv nachkochsicher, Adjektiv Nerdpartei, die Neutrino-Witz, der Panda-Umarmer, der Radfahrlinksabbiegerspur, die Schwungrad-Technologie, die Sugar-Stick, der Zahnspangen-Dichte, die Zeiterfassungschip, der - Natural Language Systems Harriehausen 61

spell aid - neologisms AIDS to xerox googling / to google photoshopping Kleenex to spell aid - neologisms AIDS to xerox googling / to google photoshopping Kleenex to pamper texting / to text …. … l. o. l. OR WS 2012/2013 lol LG HDGDL LOL - laut herauslachen - Natural Language Systems Harriehausen 62

spell aid – chat language (acronyms) AFAIK -- As Far As I Know AFK spell aid – chat language (acronyms) AFAIK -- As Far As I Know AFK -- Away From Keyboard ASAP -- As Soon As Possible BAS -- Big A** Smile BBL -- Be Back Later BBN -- Bye Now BBS -- Be Back Soon BEG -- Big Evil Grin BF -- Boyfriend BIBO -- Beer In, Beer Out BRB -- Be Right Back BTW -- By The Way BWL -- Bursting With Laughter C&G -- Chuckle and Grin CICO -- Coffee In, Coffee Out CID -- Crying In Disgrace CP -- Chat Post(a chat message) CRBT -- Crying Real Big Tears CSG -- Chuckle Snicker Grin CYA -- See You (Seeya) CYAL 8 R -- See You Later (Seeyalata) DLTBBB -- Don't Let The Bed Bugs Bite EG -- Evil Grin EMSG -- Email Message FC -- Fingers Crossed FTBOMH -- From The Bottom Of My Heart FYI -- For Your Information See: http: //www. chatdefinitions. com/ WS 2012/2013 - Natural Language Systems Harriehausen 63

spell aid – chat language (symbols) : -9 -- Delicious, Yummy : -> -- spell aid – chat language (symbols) : -9 -- Delicious, Yummy : -> -- Devilish ; -> -- Devilish Wink : P -- Disgusted (sticking out tongue) : *) -- Drunk : -6 -- Exhausted, Wiped Out : ( -- Frown ~/ -- Full Glass _/ -- Glass (drink) ^5 -- High Five : -| -- Ambivalent o: -) -- Angelic >: -( -- Angry |-I -- Asleep (: : (): : ) -- Bandaid : -{} -- Blowing a Kiss -o -- Bored : -c -- Bummed Out |C| -- Can of Coke |P| -- Can of Pepsi : ( ) -- Can't Stop Talking : *) -- Clowning : ' -- Crying : '-) -- Crying with Joy : '-( -- Crying Sadly See: http: //www. chatdefinitions. com/ WS 2012/2013 - Natural Language Systems Harriehausen 64

spell aid spell checking algorithms are based on the following types of mistakes (statistics spell aid spell checking algorithms are based on the following types of mistakes (statistics !): • phonetic similarities (ph – f : telephone – telefone) • deletion of multiple entries ( mouuse - mouse) • wrong order (from – form ; mouse – muose) • substitution of neighbouring letters on the keyboard (miuse – mouse) • include missing letters (vowels in between consonants. . . ) (telephne) • typos occur towards the end of a word (assumption: first letter is correct) • segmentation / decomposition into substrings (horses‘hoe – horse‘shoe) WS 2012/2013 - Natural Language Systems Harriehausen 65

spell aid • phonetic similarities (ph – f : telephone – telefone) • deletion spell aid • phonetic similarities (ph – f : telephone – telefone) • deletion of multiple entries ( mouuse - mouse) • wrong order (from – form ; mouse – muose) • substitution of neighbouring letters on the keyboard (miuse – mouse) • include missing letters (vowels in between consonants. . . ) (telephne) • typos occur towards the end of a word (assumption: first letter is correct) • segmentation / decomposition into substrings (horeshoe – horseshoe) WS 2012/2013 - Natural Language Systems Harriehausen 66

spell aid • include missing letters www. dositey. com/language/spelling/Mislet 3. htm WS 2012/2013 - spell aid • include missing letters www. dositey. com/language/spelling/Mislet 3. htm WS 2012/2013 - Natural Language Systems Harriehausen 67

spell aid How does spell checking work (w. r. t. grammar checking) ? Various spell aid How does spell checking work (w. r. t. grammar checking) ? Various degrees of „intelligence“: System A : no match found in the dictionary -> mark entry as incorrect System B: no match found in the dictionary. Initiate a rudimentary parse (left-right-search). Try to identify the wordclass, i. e. limit possibilities and continue a sentential analysis. e. g. the. . . man (statistics: DET + ADJ + NOUN); n-gram System C: no match found in the dictionary. Initiate a segmentation of the word to identify the wordclass, e. g. look for typical endings (-ly = adverb / capital letters = proper noun, . . . ). This way new wordcreations can be identified (e. g. any word ending in -ness = noun); n-gram WS 2012/2013 - Natural Language Systems Harriehausen 68

n-grams / language models (statistical language processing) An n-gram is a substring of n n-grams / language models (statistical language processing) An n-gram is a substring of n items from a given string. A complete string of words: w 1 … wn or n w 1 In NLP, the items in question can be phonemes, syllables, letters, words or any substring. This depends on the application. An n-gram of size 1 is a "unigram"; size 2 is a "bigram" ; size 3 is a "trigram"; etc. … size n is an "n-gram ". WS 2012/2013 - Natural Language Systems Harriehausen 69

n-grams / language models (statistical language processing) Example: „he reads a book n-grams / language models (statistical language processing) Example: „he reads a book" For a sequence of words, the trigrams would be: "# he reads", „he reads a", „reads a book", and "a book #". For sequences of characters, the trigrams that can be generated from „hello world" are "hel", "ell", "llo", "lo ", "o w", " wo", "wor" etc. In practice, we often • collapse whitespace to a single space • remove punctuation WS 2012/2013 - Natural Language Systems Harriehausen 70

n-grams / language models (statistical language processing) Example of an n-gram count from the n-grams / language models (statistical language processing) Example of an n-gram count from the GOOGLE n-gram corpus: (http: //googleresearch. blogspot. com/2006/08/all-our-n-gram-are-belong-to-you. html#!/2006/08/all-our-ngram-are-belong-to-you. html) File sizes: approx. 24 GB compressed (gzip'ed) text files Number of sentences: 95, 119, 665, 584 Number of unigrams: 13, 588, 391 Number of bigrams: 314, 843, 401 Number of trigrams: 977, 069, 902 Number of fourgrams: 1, 313, 818, 354 Number of fivegrams: 1, 176, 470, 663 WS 2012/2013 - Natural Language Systems Harriehausen 71

n-grams / language models (statistical language processing) Example of an n-gram count from the n-grams / language models (statistical language processing) Example of an n-gram count from the GOOGLE n-gram corpus: (http: //googleresearch. blogspot. com/2006/08/all-our-n-gram-are-belong-to-you. html#!/2006/08/all-our-ngram-are-belong-to-you. html) trigrams: ceramics collectables collectibles 55 ceramics collectables fine 130 ceramics collected by 52 ceramics collectible pottery 50 ceramics collectibles cooking 45 ceramics collection , 144 ceramics collection. 247 WS 2012/2013 - Natural Language Systems Harriehausen 72

n-grams / language models (statistical language processing) Example of an n-gram count from the n-grams / language models (statistical language processing) Example of an n-gram count from the GOOGLE n-gram corpus: (http: //googleresearch. blogspot. com/2006/08/all-our-n-gram-are-belong-to-you. html#!/2006/08/all-our-ngram-are-belong-to-you. html) fourgrams: serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 WS 2012/2013 - Natural Language Systems Harriehausen 73

n-grams / language models (statistical language processing) In an n-gram analysis, we compute the n-grams / language models (statistical language processing) In an n-gram analysis, we compute the probability of the occurence of x (e. g. a letter or word) AFTER a certain sequence, i. e. the conditional probability of x is always given on the basis of the PREVIOUS word/character. Example: for ex_ In English, the probabilities for a = 0. 4 b = 0. 00001 all probabilities sum to 1 c = 0, …… WS 2012/2013 - Natural Language Systems Harriehausen 74

n-grams / language models (statistical language processing) The theory behind it: A statistical language n-grams / language models (statistical language processing) The theory behind it: A statistical language model assigns a probability to a sequence of n words P (w 1, …, wn) by means of a probability distribution. All words (or characters) depend on the last n-1 words. More concisely, an n-gram model predicts xi based on In probability terms, this is This is also called an n-1 -order Markov Model. In speech recognition, sequences of phonemes are often modeled using a n-gram distribution. WS 2012/2013 - Natural Language Systems Harriehausen 75

n-grams / language models (statistical language processing) In an n-gram model, the conditional probability n-grams / language models (statistical language processing) In an n-gram model, the conditional probability P (w 1, …, wm) of observing the sentence w 1, . . . , wm can be approximated: It is assumed that the probability of observing the i th word wi in the context history of the preceding i-1 words can be approximated by the probability of observing it in the shortened context history of the preceding n-1 words. In a bigram (n=2) language model, the probability of the sentence I saw the red house is approximated as: Whereas in a trigram (n=3) language model, the approximation is: WS 2012/2013 - Natural Language Systems Harriehausen 76

single characters (German) (statistical language processing) source: http: //de. wikipedia. org/wiki/Buchstabenh%C 3%A 4 ufigkeit single characters (German) (statistical language processing) source: http: //de. wikipedia. org/wiki/Buchstabenh%C 3%A 4 ufigkeit WS 2012/2013 - Natural Language Systems Harriehausen 77

single characters (German) (statistical language processing) source: http: //de. wikipedia. org/wiki/Buchstabenh%C 3%A 4 ufigkeit single characters (German) (statistical language processing) source: http: //de. wikipedia. org/wiki/Buchstabenh%C 3%A 4 ufigkeit WS 2012/2013 - Natural Language Systems Harriehausen 78

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries (MWE) 5 spell aid 6 regular expressions 7 Finite State Automata (FSA) WS 2012/2013 - Natural Language Systems Harriehausen 79

regular expressions (Jurafsky, section 2. 1) • In order to figure out whether something regular expressions (Jurafsky, section 2. 1) • In order to figure out whether something is an incorrect word, the machine has to match the string (= a sequence of symbols; any sequence of alphanumeric characters (letters, numbers, spaces, tabs, punctuation) to an entry in the dictionary • other matches: e. g. information retrieval in www-search engines (Google, altavista, …) • the standard notation for characterizing text sequences= regular expressions • regular expressions are written in (regular expression) languages: e. g. Perl, grep (Global Regular Expression Print) • formally, regular expressions are algebraic notations for characterizing a set of strings • regular expression search requires a pattern that we want to search for (and a corpus of text to search through) (text mining !) WS 2012/2013 - Natural Language Systems Harriehausen 80

regular expressions (Jurafsky, section 2. 1) Example: Search for the pattern “linguistics”. • You regular expressions (Jurafsky, section 2. 1) Example: Search for the pattern “linguistics”. • You also want to find documents with “Linguistics” and “LINGUISTICS”. (remember: the computer does EXACTLY do what you tell him to…) • The regular expression /linguistics/ matches any string in any document containing exactly the substring “linguistics” • Regular expressions are case sensitive • samples (Jurafsky, p. 23) regular expression /woodchucks/ /a/ /Claire says, / /song/ /!/ WS 2012/2013 example pattern matched “interesting links to woodchucks and lemurs” “Mary Ann stopped by Mona’s” Dagmar, my gift please, ” Claire says, ” “all our pretty songs” “You’ve left the burglar behind again!” said Nori - Natural Language Systems Harriehausen 81

regular expressions (Jurafsky, section 2. 1) linguistics - LINGUSTICS to search for alternative characters regular expressions (Jurafsky, section 2. 1) linguistics - LINGUSTICS to search for alternative characters “l” and/or “L” we use square brackets: [l L] Regular expression match sample pattern /[l L] inguistics/ Linguistics or linguistics “computational linguistics is fun” /[1 2 3 4 5 6 7 8 9 0]/ any digit this is Linguistics 5981 WS 2012/2013 - Natural Language Systems Harriehausen 82

regular expressions (Jurafsky, section 2. 1) to search for a character in a range regular expressions (Jurafsky, section 2. 1) to search for a character in a range we use the dash: [-] Regular expression match /[A-Z]/ any uppercase letter this is Linguistics 5981 /[0 -9]/ any single digit this is Linguistics 5981 /[1 2 3 4 5 6 7 8 9 0]/ WS 2012/2013 - Natural Language Systems Harriehausen sample pattern 83

regular expressions (Jurafsky, section 2. 1) to search for negation, i. e. a character regular expressions (Jurafsky, section 2. 1) to search for negation, i. e. a character that I do NOT want to find we use the caret: [^] Regular expression match /[^A-Z]/ not an uppercase letter this is Linguistics 5981 /[^L l]/ /[^. ]/ sample pattern neither L nor l this is Linguistics 5981 not a period this is Linguistics 5981 Special characters: * . ? n t WS 2012/2013 an asterisk “L*I*N*G*U*I*S*T*I*C*S” a period “Dr. Doolittle” a question mark “Is this Linguistics 5981 ? ” a newline a tab - Natural Language Systems Harriehausen 84

regular expressions (Jurafsky, section 2. 1) to search for optional characters we use the regular expressions (Jurafsky, section 2. 1) to search for optional characters we use the question mark: [? ] Regular expression match /colou? r/ colour or color sample pattern beautiful colour to search for any number of a certain character we use the Kleene star: [*] Regular expression match /a*/ any string of zero or more “a”s /aa*/ at least one a but also any number of “a”s WS 2012/2013 - Natural Language Systems Harriehausen 85

regular expressions (Jurafsky, section 2. 1) To look for at least one character of regular expressions (Jurafsky, section 2. 1) To look for at least one character of a type we use the Kleene “+”: Regular expression match /[0 -9]+/ a sequence of digits Any combination is possible Regular expression match /[ab]*/ zero or more “a”s or “b”s /[0 -9]*/ any integer (= a string of digits) WS 2012/2013 - Natural Language Systems Harriehausen 86

regular expressions (Jurafsky, section 2. 1) The “. ” is a very special character regular expressions (Jurafsky, section 2. 1) The “. ” is a very special character -> so-called wildcard Regular expression match sample pattern /b. ll/ any character between b and ll ball bell bull bill Will the search find “Bill” ? WS 2012/2013 - Natural Language Systems Harriehausen 87

regular expressions (Jurafsky, section 2. 1) Anchors (start of line: “^”, end of line: regular expressions (Jurafsky, section 2. 1) Anchors (start of line: “^”, end of line: ”$”) Regular expression match sample pattern /^Linguistics/ “Linguistics” at the beginning of a line Linguistics is fun. /linguistics. $/ “linguistics” at the We like linguistics. end of a line Anchors (word boundary: “b”, non-boundary: ”B”) Regular expression match sample pattern /btheb/ “the” alone This is the place. /BtheB/ “the” included This is my mother. WS 2012/2013 - Natural Language Systems Harriehausen 88

regular expressions (Jurafsky, section 2. 1) More on alternative characters: the pipe symbol: “|” regular expressions (Jurafsky, section 2. 1) More on alternative characters: the pipe symbol: “|” (disjunction) Regular expression match sample pattern /colou? r/ colour or color beautiful colour /progra(m|mme)/ program or programme linguistics program WS 2012/2013 - Natural Language Systems Harriehausen 89

regular expressions (Jurafsky, section 2. 1) What does the following expression match ? /student regular expressions (Jurafsky, section 2. 1) What does the following expression match ? /student [0 -9]+ */ Will it match “student 1 student 2 student 3” ? WS 2012/2013 - Natural Language Systems Harriehausen 90

regular expressions (Jurafsky, section 2. 1) Perl expressions are also used for string substitution: regular expressions (Jurafsky, section 2. 1) Perl expressions are also used for string substitution: (used in ELIZA) s/man/men/ man -> men Perl expressions are also used for string repetition via memory: (the number operator) s/(linguistics)/wonderful 1/ linguistics-> wonderful linguistics ELIZA s/. * YOU ARE (depressed|sad). */ I AM SORRY TO HEAR YOU ARE 1/ YOU ARE (depressed|sad). */ WHY DO YOU THINK YOU ARE 1 ? / WS 2012/2013 - Natural Language Systems Harriehausen s/. * 91

content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries content 1 morphemes 2 compounds / concatenation 3 idiomatic phrases 4 multiple word entries (MWE) 5 spell aid 6 regular expressions 7 Finite State Automata (FSA) WS 2012/2013 - Natural Language Systems Harriehausen 92

Finite State Automata (FSA) The regular expression is more than just a convenient metalanguage Finite State Automata (FSA) The regular expression is more than just a convenient metalanguage for text searching. • First, a regular expression is one way of describing a finite-state automaton (FSA). Finite-state automata are theoretical foundation of a good deal of the computational work we will describe and look at in this lecture. Any regular expression can be implemented as a finite-state automaton*. Symmetrically, any finite-state automaton can be described with a regular expression. • Second, a regular expression is one way of characterizing a particular kind of formal language called a regular language. Both regular expressions and finite-state automata can be used to describe regular languages. The relation among these three theoretical constructions is sketched out in the following figure: * Except regular expressions that use the memory feature – more on that later WS 2012/2013 - Natural Language Systems Harriehausen 93

Finite State Automata (FSA) regular expressions Finite Automata regular languages The relationship between finite Finite State Automata (FSA) regular expressions Finite Automata regular languages The relationship between finite state automata, regular expressions, and regular languages* * as suggested by Martin Kay in: Kay, M. (1987). Nonconcatenative finite-state morphology. In Proceedings of the Third Conference of the European Chapter of the ACL (EACL-87), Copenhagen, Denmark, pp. 2 -10. ACL. ). WS 2012/2013 - Natural Language Systems Harriehausen 94

Finite State Automata (FSA) Examples: • Introduction to finite-state automata for regular expressions • Finite State Automata (FSA) Examples: • Introduction to finite-state automata for regular expressions • Mapping from regular expressions to automata examples WS 2012/2013 - Natural Language Systems Harriehausen 95

Finite State Automata (FSA) Using a FSA to recognize sheeptalk After a while, with Finite State Automata (FSA) Using a FSA to recognize sheeptalk After a while, with the parrot‘s help, the Doctor got to learn the language of the animals so well that he could talk to them himself and understand everything they said. Hugh Lofting, The Story of Doctor Doolittle WS 2012/2013 - Natural Language Systems Harriehausen 96

Finite State Automata (FSA) Using a FSA to recognize sheeptalk Sheep language can be Finite State Automata (FSA) Using a FSA to recognize sheeptalk Sheep language can be defined as any string from the following (infinite) set: baa! baaaa! baaaaaa!. . WS 2012/2013 - Natural Language Systems Harriehausen 97

Finite State Automata (FSA) baa! baaaa! baaaaaa!. . The regular expression for this kind Finite State Automata (FSA) baa! baaaa! baaaaaa!. . The regular expression for this kind of sheeptalk is /baa+!/ All regular expressions can be represented as finite-state automata (FSA): WS 2012/2013 - Natural Language Systems Harriehausen 98

Finite State Automata (FSA) a b q 0 a q 1 a q 2 Finite State Automata (FSA) a b q 0 a q 1 a q 2 start state ! q 3 q 4 final state/ accepting state a finite-state automaton (FSA) for the regular expression /baa+!/ WS 2012/2013 - Natural Language Systems Harriehausen 99

Finite State Automata (FSA) q 0 . . . a b a ! b Finite State Automata (FSA) q 0 . . . a b a ! b . . . a tape with cells Example of non-finite state = rejection of the input WS 2012/2013 - Natural Language Systems Harriehausen 100

Finite State Automata (FSA) Input State b a ! 0(null) 1 0 0 1 Finite State Automata (FSA) Input State b a ! 0(null) 1 0 0 1 0 2 0 3 0 3 4 4: 0 0 0 The state-transition table for the previous FSA WS 2012/2013 - Natural Language Systems Harriehausen 101

Finite State Automata (FSA) An algorithm for deterministic recognition of FSAs function D-RECOGNIZE(tape, machine) Finite State Automata (FSA) An algorithm for deterministic recognition of FSAs function D-RECOGNIZE(tape, machine) returns accept or reject index <- Beginning of tape current-state <- Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elseif transition-table[current-state, tape[index]] is empty then return reject else current-state <- transition-table[current-state, tape[index]] index <- index +1 end WS 2012/2013 - Natural Language Systems Harriehausen 102

Finite State Automata (FSA) q 0 q 1 q 2 q 3 q 4 Finite State Automata (FSA) q 0 q 1 q 2 q 3 q 4 q 5 . . . b a a a ! . . . Tracing the execution of FSA on some sheeptalk WS 2012/2013 - Natural Language Systems Harriehausen 103

Finite State Automata (FSA) Regular expressions can be represented as FSAs: a fail state Finite State Automata (FSA) Regular expressions can be represented as FSAs: a fail state b a q 0 q 1 ! ? c a b ! q 2 ! b ! q 3 b a q 4 ! b a qf WS 2012/2013 - Natural Language Systems Harriehausen 104

Finite State Automata (FSA) a b q 0 a q 1 a q 2 Finite State Automata (FSA) a b q 0 a q 1 a q 2 ! q 3 q 4 A non-deterministic finite-state automaton for talking sheep WS 2012/2013 - Natural Language Systems Harriehausen 105

Finite State Automata (FSA) b q 0 a q 1 a q ! q Finite State Automata (FSA) b q 0 a q 1 a q ! q 3 2 q 4 E A non-finite-state automaton (NFSA) for the sheep language – having an E-transition WS 2012/2013 - Natural Language Systems Harriehausen 106