Скачать презентацию Automatic term extraction from domain corpora Piek Vossen Скачать презентацию Automatic term extraction from domain corpora Piek Vossen

09a73b4c6e87d17102d370fef88bed38.ppt

  • Количество слайдов: 23

Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Automatic term extraction from domain corpora Piek Vossen Irion Technologies/Vrije Universiteit Amsterdam Gastcollege Corpus-based Methods Universiteit Nijmegen, 26 November 2007

Overview Corpus versus Domain-based text collections Customer-case Term-extraction Demo Gastcollege, Corpus-based Methods, Universiteit Nijmegen, Overview Corpus versus Domain-based text collections Customer-case Term-extraction Demo Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Corpus versus Domain-based text collections Corpus to study linguistic phenomena: Domain corpora: INL corpus: Corpus versus Domain-based text collections Corpus to study linguistic phenomena: Domain corpora: INL corpus: NRC-handelsblad Corpus geschreven Nederlands British National Corpus Brown corpus -> Sem. Cor portals Wikipedia Customer corpora: web sites manuals Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Customer-case Connect suppliers and buyers and create traffic and advertisement B 2 B: companies Customer-case Connect suppliers and buyers and create traffic and advertisement B 2 B: companies with specialized products and services terminology driven branch driven C 2 B: consumers looking for products and services general language terminology: -> folksonomy bottom-up Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Product name in ontology of 150, 000 products Product name in ontology of 150, 000 products "kleppen, vlinder, pomp, hoge druk" (valves, butterfly, pump, high pressure) "Wij zijn gespecialiseerd in: pompen en pomponderdelen zoals kleppen" (We are specialized in: pumps and components such as valves user query searching for products or servcies "vlinderkleppen voor een hoge drukpomp" (butterfly valves for high pressure pumps) product name on company website Subscription for product names Companies in database 1. 5 million websites Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Term-extraction morpho-syntactic analysis statistical analysis conceptual analysis contextual analysis Gastcollege, Corpus-based Methods, Universiteit Nijmegen, Term-extraction morpho-syntactic analysis statistical analysis conceptual analysis contextual analysis Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Term-extraction: morpho-syntactic analysis Tokenization, tagging and NP-chunking: “een gele kaart voor de vleugelaanvaller” (a Term-extraction: morpho-syntactic analysis Tokenization, tagging and NP-chunking: “een gele kaart voor de vleugelaanvaller” (a yellow card for the wing-player) Term candidates: Syntactic head of NPs: kaart (card); vleugelaanvaller (wing-player). Word combinations including syntactic head: gele kaart (yellow card); kaart voor vleugelaanvaller (card for wingplayer). Head of compounds: aanvaller (attacker-player). Term is a concept: Normalized form (plural-singular variants, synonyms) Hypernym based on the syntactic head Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Term-extraction: statistical analysis Reference corpus based on 500 websites of diverse range of companies Term-extraction: statistical analysis Reference corpus based on 500 websites of diverse range of companies Salience = norm. Freq * norm. Ref norm. Freq = normalized frequency of terms on the website norm. Freq = n. Term. Frequencyn. Words / n. Pages norm. Ref = normalized number of websites on which the term occurs in the reference corpus multiwords: norm. Ref = 1 -((n. Websitesn. Words) / (reference. Corpus. Size)) singlewords: norm. Ref = 1 -((n. Websites) / (reference. Corpus. Size)) Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Preferred term n. Tokens n. Pages Salience Klicken (click) 9 9 0. 0010 Weitere Preferred term n. Tokens n. Pages Salience Klicken (click) 9 9 0. 0010 Weitere (further) 5 2 0. 0011 Wahl (choose) 1 1 0. 0011 Verkauf (sell) 1 1 0. 0011 37 3 0. 0011 Radio (radio) 1 1 0. 0011 Promotionen (promote) 1 1 0. 0011 Optionen (options) 6 2 0. 0011 Netzwerk (network) 4 2 0. 0011 Medias (media) 1 1 0. 0011 Kauf (buy) 1 1 0. 0011 Html 1 1 0. 0011 Gewerbe (commercial) 1 1 0. 0011 16 16 0. 0011 1 1 0. 0011 Service (service) Fax Büro (office) Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Term-extraction: conceptual analysis Structural properties of the term hierarchy Poor hierarchies: many tops few Term-extraction: conceptual analysis Structural properties of the term hierarchy Poor hierarchies: many tops few levels diverse branches Each branch is a concept: number of descendants and levels cumulated frequency of descendants Branch profiling: Domain classification of the hierarchy Domain classification of each branch Minimal overlap in domain Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Wordnet: Domain information Concepts Vocabularies of languages 1 rec: 12345 - financial institute 2 Wordnet: Domain information Concepts Vocabularies of languages 1 rec: 12345 - financial institute 2 Relations rec: 54321 - river side bank Clothing 1 rec: 9876 - small string instrument Ball Winter sports rec: 65438 - musician playing a violin violist rec: 42654 - musician 1 string Culture Sport Finance Music 2 violin Domains 2 type-of rec: 35576 - string of an instrument rec: 29551 - underwear rec: 25876 - string instrument Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 part-of type-of

Term-extraction: contextual analysis Anything can be a product or service: there are no intrinsic Term-extraction: contextual analysis Anything can be a product or service: there are no intrinsic properties to define products Contextual features: context patterns for products product pages special marking in HTML Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Term-extraction: contextual analysis Context patterns for products: 144 patterns in English and 288 patterns Term-extraction: contextual analysis Context patterns for products: 144 patterns in English and 288 patterns in German [we supply] [we deliver] [we provide] [our products are][we are one of the leading, producers on the market for] [we are, leading, producers on the market for] [is one of the leading, producers on the market for] [is, leading, producer on the market for] [we develop, products for] [we design, products for] [we produce, products for] [Our most common products] Each term is scored for a product context in terms of the strength of the pattern and the distance Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Term-extraction: contextual analysis Product pages: landing page: index. html files with product names: product, Term-extraction: contextual analysis Product pages: landing page: index. html files with product names: product, service, solution html files referred to by these pages html files referred to by menus with such names Special marking in HTML: meta keywords headings and titles menus Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Product terms with feature bundles Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007 Product terms with feature bundles Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

<class> <name><![CDATA[arabica-kaffee gemahlene]]></name> <id>48</id> <pos>1</pos> <preferred_form><![CDATA[Arabica-Kaffee Gemahlener]]></preferred_form> <parent_form><![CDATA[Gemahlener]]></parent_form> <documents>1</documents> <frequency>1</frequency> <salience>0. 0523</salience> <connectivity>10</connectivity> <modifiers> 48 1 1 1 0. 0523 10 arabica kaffee arabica-kaffee -1 1 1 1 RIGHT kaffee 1. 0 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Evaluation of French product extraction Nr. of URLs 29 Nr. of evaluated URLs 27 Evaluation of French product extraction Nr. of URLs 29 Nr. of evaluated URLs 27 Total good terms 95 Total bad terms 53 Total new terms 54 Total terms 202 Average n. Terms per URL 6 Total precision (good/(good+bad) 64 Average precision (n. Precision/Evaluated Urls) 68 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007

Evaluation of French product extraction Good Bad New Precision n. Tokens low(2) 67 35 Evaluation of French product extraction Good Bad New Precision n. Tokens low(2) 67 35 18 65 med(5) 12 7 19 63 high(>5) 16 11 17 59 get. Nr. Docs low(2) 71 38 31 65 med(5) 13 5 11 72 high(>5) 11 10 12 52 n. Salience low(0. 05) 0 0 med(0. 1) 0 0 high(0. 5) 4 3 3 57 top(>0. 5) 91 50 51 64 n. Siblings low(2) 84 51 48 62 med(5) 11 2 6 84 high(>5) 0 0 n. Cum. Freq. Parent low(2) 68 43 21 61 med(5) 19 5 11 79 Gastcollege, high(>5) Corpus-based Methods, Universiteit 8 5 22 61 Nijmegen, 26 november 2007

Evaluation of French product extraction Term Source meta 64 29 11 68 product 26 Evaluation of French product extraction Term Source meta 64 29 11 68 product 26 21 38 55 service 5 3 5 62 solution 0 0 index 0 0 other 0 0 -1 12 6 15 66 0 67 40 30 62 Profile. Match Good Bad New Precision low(0. 1) 0 0 med(0. 5) 2 7 9 22 high(>0. 5) 5 0 0 100 top(>0. 7) 9 0 0 100 Gastcollege, Corpus-based Methods, Universiteit Nijmegen, 26 november 2007