
a112f7cf8ab45ea4540e23d61fc1fa94.ppt
- Количество слайдов: 68
November 30, 2000 Language and Information Handout #5 (C) 2000, The University of Michigan 1
Course Information • • • Instructor: Dragomir R. Radev (radev@si. umich. edu) Office: 305 A, West Hall Phone: (734) 615 -5225 Office hours: TTh 3 -4 Course page: http: //www. si. umich. edu/~radev/760 Class meets on Thursdays, 5 -8 PM in 311 West Hall (C) 2000, The University of Michigan 2
Clustering (Cont’d) (C) 2000, The University of Michigan 3
Using similarity in visualization • Dendrograms (see Figure 14. 1 of M&S, page 496) (C) 2000, The University of Michigan 4
Types of clustering • Hierarchical: agglomerative, divisive • Soft & hard • Similarity functions: – Single link: most similar members – Complete link: least similar members – Group-average: average similarity • Applications: – improving language models – etc. (C) 2000, The University of Michigan 5
HITS-type algorithms (C) 2000, The University of Michigan 6
Hyperlinks and resource communities • • • Jon Kleinberg (Almaden, Cornell) authoritative sources www. harvard. edu --> Harvard conferring authority via links global properties of authority pages (C) 2000, The University of Michigan 7
Hubs and authorities hubs authorities (C) 2000, The University of Michigan unrelated pages 8
Authorities • Java: www. gamelan. com, java. sun. com, sunsite. unc. edu/javafaq. html • Censorship: www. eff. org, www. eff. org/blueribbon. html, www. aclu. org • Search engines: www. yahoo. com, www. excite. com, www. mckinley. com, www. lycos. com (C) 2000, The University of Michigan 9
Related pages • www. honda. com: www. ford. com, www. toyota. com, www. yahoo. com • www. nyse. com: www. amex. com, www. liffe. com, update. wsj. com (C) 2000, The University of Michigan 10
Collocations (C) 2000, The University of Michigan 11
Collocations • Idioms • Free word combinations • Know a word by the company that it keeps (Firth) • Common use • No general syntactic or semantic rules • Important for non-native speakers (C) 2000, The University of Michigan 12
Examples Idioms Collocations Free-word combinations To kick the bucket Dead end To catch up To trade actively Table of contents Orthogonal projection To take the bus The end of the road To buy a house (C) 2000, The University of Michigan 13
Uses • Disambiguation (e. g, “bank”/”loan”, ”river”) • Translation • Generation (C) 2000, The University of Michigan 14
Properties • • • Arbitrariness Language- and dialect-specific Common in technical language Recurrent in context (see Smadja 83) (C) 2000, The University of Michigan 15
Arbitrariness • Make an effort vs. *make an exertion • Running commentary vs. *running discussion • Commit treason vs. *commit treachery (C) 2000, The University of Michigan 16
Cross-lingual properties • Régler la circulation = direct traffic • Russian, German, Serbo-Croatian: direct translation is used • AE: set the table, make a decision • BE: lay the table, take a decision • “semer le désarroi” - “to sow disarray” - “to wreak havoc” (C) 2000, The University of Michigan 17
Types of collocations • Grammatical: come to, put on; afraid that, fond of, by accident, witness to • Semantic (only certain synonyms) • Flexible: find/discover/notice by chance (C) 2000, The University of Michigan 18
Base/collocator pairs Base Collocator Example Noun Verb Adjective Verb verb adjective adverb preposition Set the table Warm greetings Struggle desperately Sound asleep Put on (C) 2000, The University of Michigan 19
Extracting collocations • Mutual information I (x; y) = log 2 P(x, y) P(x)P(y) • What if I(x; y) = 0? • What if I(x; y) < 0? (C) 2000, The University of Michigan 20
Yule’s coefficient A - frequency of lemma pairs involving both Li and Lj B - frequency of pairs involving Li only C - frequency of pairs involving Lk only D - frequency of pairs involving neither YUL = (C) 2000, The University of Michigan AD - BC AD + BC -1 YUL 1 21
Specific mutual information • Used in extracting bilingual collocations I (e, f) = p (e, f) p(e) p(f) • p(e, f) - probability of finding both e and f in aligned sentences • p(e), p(f) - probabilities of finding the word in one of the languages (C) 2000, The University of Michigan 22
Example from the Hansard corpus (Brown, Lai, and Mercer) (C) 2000, The University of Michigan 23
Flexible and rigid collocations • Example (from Smadja): “free” and “trade” (C) 2000, The University of Michigan 24
Xtract (Smadja) • The Dow Jones Industrial Average • The NYSE’s composite index of all its listed common stocks fell *NUMBER* to *NUMBER* (C) 2000, The University of Michigan 25
Translating Collocations • Brush up a lesson, repasser une leçon • Bring about/осуществлять • Hansards: late spring: fin du printemps, Atlantic Canada Opportunities Agency, Agence de promotion économique du Canada atlantique (C) 2000, The University of Michigan 26
The e. Sse. NSe system (C) 2000, The University of Michigan 27
Offline processing Online processing user query cached documents cluster 1 cluster 2 cluster 3 … … cluster n hitlist … summary 1 summary 2 (C) 2000, The University of Michigan summary 3 summary n summary 28
(C) 2000, The University of Michigan 29
(C) 2000, The University of Michigan 30
(C) 2000, The University of Michigan 31
(C) 2000, The University of Michigan 32
(C) 2000, The University of Michigan 33
Sample summary Sentence Score The idea behind data mining then is the non-trivial process of identifying valid novel potentially useful and ultimately understandable patterns in data 18 2 The term knowledge discovery in databases KDD was formalized in 1989 in reference to the general concept of being broad and 'high level' in the pursuit of seeking knowledge from data 494. 92 The term data mining is then this high-level application techniques / tools used to present and analyze data for decision makers 509. 11 This term data mining has been used by statisticians data analyst and the MIS management information systems community whereas KDD has been mostly used by artificial intelligence and machine learning researchers 487. 92 These are : -the untapped value in large databases consolidation of database records tending towards a single customer view concept of an information or data warehouse from the consolidation of databases dramatic drop in the cost/performance ratio of hardware systems - for data storage and processing 576. 60 Intense competition in an increasing saturated marketplace the ability to custom manufacture market and advertise to small market segments and individuals 4 and the market for data mining products is estimated at about 500 million in early 1994 12 Data mining technologies are characterized by intensive computations on large volumes of data 486. 92 Data mining versus traditional database queries Traditional database queries contrasts with data mining since these are typified by the simple question such as what were the sales of orange juice in January 1995 for the Boston area 520. 53 Data mining on the other hand through the use of specific algorithms or search engines attempts to source out discernable patterns and trends in the data and infers rules from these patterns 500. 80 (C) 2000, The University of Michigan 34
(C) 2000, The University of Michigan 35
(C) 2000, The University of Michigan 36
(C) 2000, The University of Michigan 37
(C) 2000, The University of Michigan 38
(C) 2000, The University of Michigan 39
(C) 2000, The University of Michigan 40
(C) 2000, The University of Michigan 41
Cross-language information access (C) 2000, The University of Michigan 42
CE SE QE DE SE English (C) 2000, The University of Michigan 43
CE SE QC QE DE SE English (C) 2000, The University of Michigan Chinese 44
CE SE SC CC QC QE DE SE English (C) 2000, The University of Michigan SC DC Chinese 45
CE SE SC CC QC QE DE SE English (C) 2000, The University of Michigan SC DC Chinese 46
CE SE SC CC QC QE DE SE English (C) 2000, The University of Michigan SC DC Chinese 47
CE SE SC CC QC QE DE SE English (C) 2000, The University of Michigan SC SC->E DC Chinese 48
Objectives • Produce summaries using multiple algorithms • evaluate summarization and translation separately • intrinsic and extrinsic language-independent evaluation metrics • establish correlation between evaluation metrics • build parallel C-E doc+summary 9 K docs (Hong Kong news) (C) 2000, The University of Michigan 49
Participants • Full time – – K. -L. Kwok, Queens College Dragomir Radev, U. Michigan Wai Lam, Chinese University of HK Simone Teufel, Columbia • “Consultants” – – Chin-Yew Lin, ISI Tomek Strzalkowski, Albany Jade Goldstein, CMU Jian-Yun Nie, U. Montréal • Supporters – TIDES roadmap group: Ed Hovy, Daniel Marcu, Kathy Mc. Keown (C) 2000, The University of Michigan 50
Techniques and parameters • Summarization: – position, TF*IDF, centroids, largest common subsequence, keywords • Evaluation: – intrinsic: percent agreement, relative utility, precision/recall – extrinsic: document rank, question answering • Length of documents/summaries (C) 2000, The University of Michigan 51
The parallel corpus • English and Chinese (Hong Kong News) • Already there: – 9000 documents and their translations – list of 300 queries in English and their translations • We will create before the workshop: – document relevance judgements • 50 queries, 5 hrs/query, $10/hr -> $2, 500 – sentence relevance judgements • 4 doc/hr, need 4000 rel. judgements -> $10, 000 – optional: manual abstracts (C) 2000, The University of Michigan 52
Creating the judgements • For each query – submit to IR engine – discard unless it has 5 -20 hits – get exhaustive document relevance judgements – consider top 100 documents • get sentence relevance judgements for – all relevant judgements – top 50 documents (including irrelevant ones!) (C) 2000, The University of Michigan 53
Experiments • Experiment 1: (Validation) Compare preservation of ranking with other measures: judgement overlap, relative utility • Experiment 2 & Experiment 3: – use with preservation of ranking – only possible due to new, parallel experimental design – factor out effects of • query translation • summarization • monolingual IR • Baseline: – leading sentence summaries vs. documents – other summarization methods vs. documents – (ideal: manual summaries vs. documents) (C) 2000, The University of Michigan 54
Experiments • Monolingual experiments – Effect of summarization • English Query -> English Doc (ranks) • English Query-> English Summary (ranks) • Chinese Query -> Chinese Doc (ranks) • Chinese Query -> Chinese Summary (ranks) – Baseline: • leading sentence summary vs. document • ideal: manual summary vs. document – Effect of language on IR • English Query -> English Doc • Chinese Query-> Chinese Doc • Experiment 2: crosslingual – Effect of query translation • English Query -> English Doc • English Query -> Chinese Doc (C) 2000, The University of Michigan 55
Timeline • Pre-workshop: build corpus • Sentence segmenter, Chinese tokenizer, machine translation, IR system, e. Sse. NSe summarizer • Workshop: system integration, build toolkit, summarization, evaluation, correlation, system refinement, final evaluation (C) 2000, The University of Michigan 56
Workshop W 1 W 2 W 3 W 4 W 5 W 6 Set up experimental testbed Evaluation plan laid out Selection of training/test sub-corpora Alpha version of CLIA system tested on a small number of queries Baseline experiment Run experiment one Run experiment two Compute query translation quality Run experiment three Feedback from first three experiments System improvements Improved CLIA system ready Evaluation using unseen test data Draft of final report Additional experiments Wrap-up Final version of CLIA system released (C) 2000, The University of Michigan 57
Merit criteria • Novelty: never done before, integration of CLIR and summarization (C) 2000, The University of Michigan 58
Merit criteria • Novelty: never done before, integration of CLIR and summarization • Collaboration: participants wouldn’t work together otherwise (C) 2000, The University of Michigan 59
Merit criteria • Novelty: never done before, integration of CLIR and summarization • Collaboration: participants wouldn’t work together otherwise • Scientific merit: much-needed evaluation metrics, techniques for multi-document, multilingual summarization, incorporate utility, redundancy, subsumption (C) 2000, The University of Michigan 60
Merit criteria • Novelty: never done before, integration of CLIR and summarization • Collaboration: participants wouldn’t work together otherwise • Scientific merit: much-needed evaluation metrics, techniques for multi-document, multilingual summarization, incorporate utility, redundancy, subsumption • Feasibility: uses existing work, specific plan for new work (C) 2000, The University of Michigan 61
Merit criteria • Novelty: never done before, integration of CLIR and summarization • Collaboration: participants wouldn’t work together otherwise • Scientific merit: much-needed evaluation metrics, techniques for multi-document, multilingual summarization, incorporate utility, redundancy, subsumption • Feasibility: uses existing work, specific plan for new work • Community building: corpora, evaluation techniques, and software (CLIR, evaluation, and summarization), builds on prior evaluations (TDT, TREC, SUMMAC, DUC) (C) 2000, The University of Michigan 62
Merit criteria • Novelty: never done before, integration of CLIR and summarization • Collaboration: participants wouldn’t work together otherwise • Scientific merit: much-needed evaluation metrics, techniques for multi-document, multilingual summarization, incorporate utility, redundancy, subsumption • Feasibility: uses existing work, specific plan for new work • Community building: corpora, evaluation techniques, and software (CLIR, evaluation, and summarization), builds on prior evaluations (TDT, TREC, SUMMAC, DUC) • Funder interest: multilingual systems, large amounts of data (C) 2000, The University of Michigan 63
Merit criteria • Novelty: never done before, integration of CLIR and summarization • Collaboration: participants wouldn’t work together otherwise • Scientific merit: much-needed evaluation metrics, techniques for multi-document, multilingual summarization, incorporate utility, redundancy, subsumption • Feasibility: uses existing work, specific plan for new work • Community building: corpora, evaluation techniques, and software (CLIR, evaluation, and summarization), builds on prior evaluations (TDT, TREC, SUMMAC, DUC) • Funder interest: multilingual systems, large amounts of data (C) 2000, The University of Michigan 64
What was dropped • Interactive clustering of documents • Evaluation of the quality of translated summaries • Document translation • Effects of document genre, length • Evolving summaries (C) 2000, The University of Michigan 65
More… (C) 2000, The University of Michigan 66
What we didn’t talk about • • • Hidden Markov models Part of speech tagging Probabilistic parsing Information retrieval Text classification etc. (C) 2000, The University of Michigan 67
THE END ? http: //perun. si. umich. edu/~radev/760/job/ (C) 2000, The University of Michigan 68
a112f7cf8ab45ea4540e23d61fc1fa94.ppt