Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Mamoru

Graph-based Analysis of Espresso-style Minimally-supervised Bootstrapping Algorithms Mamoru Komachi Nara Institute of Science and Technology Jan 15, 2010

Corpus-based extraction of semantic knowledge Input (Extracted from corpus) Output Instance Pattern New instance Singapore ___ visa Hong Kong Singapore visa Australia Singapore map Hong Kong ___ history China Egypt Alternate step by step 2

Semantic drift is the central problem of bootstrapping Input (Extracted from corpus) Output Instance Pattern New instance Singapore visa ___ card Generic patterns Patterns co-occurring with many irrelevant instances card Australia Semantic category changed! greeting __ messages words Errors propagate to successive iteration 3

Main contributions of this work 1. Suggest a parallel between semantic drift in Espresso [Pantel and Pennachiotti, 2006] style bootstrapping and topic drift in HITS [Kleinberg, 1999] 2. Solve semantic drift using “relatedness” measure (regularized Laplacian) instead of “importance” measure (HITS authority) used in link analysis community 4

Table of contents 2. Overview of Bootstrapping Algorithms 3. Espresso-style Bootstrapping Algorithms 5. Word Sense Disambiguation 4. Graph-based Analysis of Espresso-style Bootstrapping Algorithms 6. Bilingual Dictionary Construction 7. Learning Semantic Categories 5

Espresso Algorithm [Pantel and Pennacchiotti, 2006] • Repeat • • • Pattern extraction Pattern ranking Pattern selection Instance extraction Instance ranking Instance selection • Until a stopping criterion is met 6

Pattern/instance ranking in Espresso Score for pattern p Score for instance i p: pattern i: instance P: set of patterns I: set of instances pmi: pointwise mutual information max pmi: max of pmi in all the patterns and instances 7

Espresso uses pattern-instance matrix A for ranking patterns and instances |P|×|I|-dimensional matrix holding the (normalized) pointwise mutual information (pmi) between patterns instance indices and instances 1 2 . . . i . . . |I| 1 2 pattern indices : p : |P| [A]p, i = pmi(p, i) / maxp, i pmi(p, i) 8

Pattern/instance ranking in Espresso p = pattern score vector Reliable instances are supported by i = instance score vector reliable patterns, and vice versa A = pattern-instance matrix . . . pattern ranking. . . instance ranking |P| = number of patterns |I| = number of instances normalization factors to keep score vectors not too large 9

Espresso Algorithm [Pantel and Pennacchiotti, 2006] • Repeat • • • Pattern extraction Pattern ranking Pattern selection Instance extraction Instance ranking Instance selection For graph-theoretic analysis, we will introduce 3 simplifications to Espresso • Until a stopping criterion is met 10

First simplification • Compute the pattern-instance matrix • Repeat Simplification 1 • • • Pattern extraction Remove pattern/instance extraction steps Pattern ranking Instead, Pattern selection pre-compute all patterns and instances Instance extraction once in the beginning of the algorithm Instance ranking Instance selection • Until a stopping criterion is met 11

Second simplification Simplification 2 Remove pattern/instance selection steps • Compute the pattern-instance matrix which retain only highest scoring k patterns • Repeat / m instances for the next iteration i. e. , reset the scores of other items to 0 • Pattern ranking Instead, • Pattern selection retain scores of all patterns and instances • Instance ranking • Instance selection • Until a stopping criterion is met 12

Third simplification • Compute the pattern-instance matrix • Repeat • Pattern ranking Simplification 3 No early stopping i. e. , run until convergence • Instance ranking • Until a stopping criterion is met Until score vectors p and i converge 13

Simplified Espresso Input • Initial score vector of seed instances • Pattern-instance co-occurrence matrix A Main loop Repeat . . . pattern ranking. . . instance ranking Until i and p converge Output Instance and pattern score vectors i and p 14

Simplified Espresso is HITS Simplified Espresso = HITS in a bipartite graph whose adjacency matrix is A Problem The ranking vector i tends to the principal eigenvector of ATA as the iteration proceeds regardless of the seed instances! ➔ matter which seed you start with, the same instance is always No ranked topmost ➔ Semantic drift (also called topic drift in HITS) 15

How about Espresso? Espresso has heuristics not present in Simplified Espresso • Early stopping • Pattern and instance selection Do these heuristics really help reduce semantic drift? 16

Experiments on semantic drift Does the heuristics in original Espresso help reduce drift?

Word sense disambiguation task of Senseval-3 English Lexical Sample Predict the sense of “bank” … the financial benefits of the bank (finance) 's employee package ( cheap mortgages and pensions, etc ) , bring this up to … In that same year I was posted to South Shields on the south bank (bank of the river) of the River Tyne and quickly became aware that I had an enormous burden Possibly aligned to water a sort of bank(? ? ? ) by a rushing river. Training instances are annotated with their sense Predict the sense of target word in the test set 18

Word sense disambiguation by Espresso Seed instance = the instance to predict its sense System output = k-nearest neighbor (k=3) Heuristics of Espresso • Pattern and instance selection • # of patterns to retain p=20 (increase p by 1 on each iteration) • # of instance to retain m=100 (increase m by 100 on each iteration) • Early stopping 19

Convergence process of Espresso 1 Precision 0. 9 Heuristics in Espresso helps reducing semantic drift (However, early stopping is required for optimal performance) Espresso 0. 8 Simplified Espresso 0. 7 Most frequent sense (baseline) 0. 6 Output the most frequent sense regardless of input 0. 5 1 6 11 Semantic drift occurs (always outputs the most frequent sense) 16 Iteration 21 26 20

Learning curve of Espresso: per-sense breakdown 1 Most frequent sense Precision 0. 9 # of most frequent sense predictions increases 0. 8 0. 7 Precision for infrequent senses worsens even with original Espresso 0. 6 0. 5 0. 4 0. 3 Other senses 1 6 11 16 Iteration 21 26 21

Summary: Espresso and semantic drift Semantic drift happens because • Espresso is designed like HITS • HITS gives the same ranking list regardless of seeds Some heuristics reduce semantic drift • Early stopping is crucial for optimal performance Still, these heuristics require • many parameters to be calibrated • but calibration is difficult 22

Main contributions of this work 1. Suggest a parallel between semantic drift in Espresso-like bootstrapping and topic drift in HITS (Kleinberg, 1999) 2. Solve semantic drift by graph kernels used in link analysis community 23

Q. What caused drift in Espresso? A. Espresso's resemblance to HITS is an importance computation method (gives a single ranking list for any seeds) Why not use a method for another type of link analysis measure - which takes seeds into account? "relatedness" measure (it gives different rankings for different seeds) 24

The regularized Laplacian kernel • A relatedness measure • Takes higher-order relations into account • Has only one parameter Normalized Graph Laplacian A: adjacency matrix of the graph D: (diagonal) degree matrix Regularized Laplacian matrix β: parameter Each column of Rβ gives the rankings relative to a node 25

Label propagation algorithm [Zhou et al. 2004] X is a set of instances and xi is an instance • Form the affinity matrix W defined by if i != j and Wii = 0. • Construct the matrix in which D is a diagonal matrix with its (i, i)-element equal to the sum of the i-th row of W. • Iterate until convergence, where alpha is a parameter in (0, 1) • Let F* denote the limit of the sequence {F(t)}. Label each point xi as a label 26

Laplacian label propagation algorithm • Form the affinity matrix W defined by where A is a row-normalized instance-pattern cooccurrence matrix • Construct the normalized Laplacian matrix • Iterate until convergence, where alpha is a parameter in (0, 1) • Let F* denote the limit of the sequence {F(t)}. Label each point xi as a label 27

The sequence {F(t)} converges to F* = (1 -α)(I-αS)-1 Y Proof: • Suppose F(0) = Y. • By iterative algorithm, • Since 0 < α < 1 and the eigenvalues of (-L) in [-1, 1] • Hence for classification, which is equivalent to This is also a form of the regularized Laplacian kernel 28

Word Sense Disambiguation Evaluation of regularized Laplacian against Espresso

Label prediction of “bank” (F measure) Espresso suffers from semantic drift (unless stopped at optimal stage) Algorithm Most frequent sense Other senses Simplified Espresso 100. 0 Espresso (after convergence) 100. 0 30. 2 Espresso (optimal stopping) 94. 4 67. 4 Regularized Laplacian (β=10 -2) 92. 1 62. 8 The regularized Laplacian keeps high recall for infrequent senses 30

WSD on all nouns in Senseval-3 Espresso needs optimal stopping to achieve an equivalent performance algorithm F measure Most frequent sense (baseline) 54. 5 Hyper. Lex 64. 6 Page. Rank 64. 6 Simplified Espresso 44. 1 Espresso (after convergence) 46. 9 Espresso (optimal stopping) 66. 5 Regularized Laplacian (β=10 -2) 67. 1 Outperforms other graph-based methods 31

Regularized Laplacian is stable across a parameter 32

Learning Semantic Categories Evaluation of regularized Laplacian by large-scaled web data

Graph-based approach to learn semantic knowledge • We showed that some of bootstrapping algorithms can be regarded as computing similarity over a graph Singapore UFJ Hong Kong ANA China ？ ___ history ___ map ___ visa ___ airlines Never tried largescale data ->Does it scale? Tested on word sense disambiguation task ->Is it taskindependent? 34

Bootstrapping achieves high precision, but requires calibration in practice Pros • Dedicated to Japanese web search query logs • Outperforms previous methods on this task • Orders of magnitude faster than previous bootstrapping methods Cons • Needs to control many parameters and configurations • Has little theoretical background • Does not scale to large corpus 35

Graph-based methods are simple but effective on large-scale data such as web documents Pros • Can scale to massive amount of raw data • Firm mathematical background (Can employ classical optimization methods) Cons • Computational efficiency (Can be approximated) • No obvious “good” graph • Requires computational resource (CPU, disk, memory, etc) and technical know-how 36

Quetchup algorithm (QUEry Term CHUnk Processor) • Using clickthrough logs as source of information • Semi-supervised learning with Laplacian Label Propagation • Efficient computation over a graph 37

Learning semantic categories from clickthrough graph • Query “Singapore” –(click)-> http: //www. cikm 2009. org/ Singapore http: //www. cikm 2009. org/ http: //www. acl-ijcnlp-2009. org/ UFJ Hong Kong ANA http: //www. singaporair. com/hk. jsp http: //en. wikipedia. org/wiki/Hong_Kong http: //www. bk. mufg. jp/ http: //www. china-airlines. co. jp/ China http: //www. ana. co. jp/ 38

Experimental setting • Search logs: Japanese web search logs collected in August 2008; retained the top 10 million distinct queries • Target categories: Travel and Finance [Komachi and Suzuki, 2008] • Evaluation: precision at k; relative recall [Pantel and Ravichandran, 2004] where RA|B is relative recall of system A given B, CX is the number of correct output from system X, C is the true number of the correct instances, PX is precison of system X, and |X| is the number of input instances for system X 39

Seed instances for each category Category Seed Travel jal (Japan Airlines), ana (All Nippon Airways), jr (Japan Railways), じゃらん (jalan: online travel guide site), his (H. I. S. Co. , Ltd. : travel agency) Finance みずほ銀行 (Mizuho Bank), 三井住友銀行 (Sumitomo Mitsui Banking Corporation), jcb, 新生銀行 (Shinsei Bank), 野村證券 (Nomura Securities) 40

Precision of Travel domain The regularized Laplacian gives best precision 41

Precision of Finance domain Again, the regularized Laplacian gives best precision 42

Relative recall of Travel domain Clickthrough logs not only achieve high precision but also high recall 43

Relative recall of Finance domain 44

Relative recall of Quetchup with varying click: query 45

Precision of Quethcup with varying click: query 46

Instances and patterns with the top 10 highest scores System Instance Pattern Quetchup じゃらん宿泊 (jalan accommodation), じゃラ click ン (jalan), ジャラン (jalan), jarann, jaran, じゃらんnet (jalan. net), jalan, じゅらん (julan), ana 予約 (ana reservation), ana. co. jp www. jalan. net/, www. ana. co. jp/, www. his-j. com/, www. jreast. co. jp/, www. jtb. co. jp/ace/, www. westjr. co. jp/, www. jtb. co. jp/kaigai/, nippon. his. co. jp/, www. jr. cyberstation. ne. jp/ Quetchup 中部発 (Traveling from Midland), his関西 (his query Kansai), 伊平屋島 (Iheya Island), ホテルコンチネンタル横浜(Hotel Continental Yokohama), げんじいの森 (Genjii-no-mori; spa), フジサファリパーク (Fuji Safari Park), ad-box, アダコミ (adacomi; offensive), スカイチーム (Sky. Team), ノースウェスト (Northwest) # 時刻表 (timetable), # 国内旅行 (domestic tour), # 宿泊 (accommodation), # 北海道 (Hokkaido), # 関西 (Kansai), #九州 (Kyushu), #マイレージ (mileage), #名古屋 (Nagoya), #沖縄 (Okinawa), #温泉 (spa) Tchai # 時刻表, # 路線図 (route map), # 運賃 (fare), # 料金 (fare), # 定期 (season ticket), # 運行状況 (service situation), # 路線 (route), #定期代 (season ticket fare), #定期券 47 (season 静鉄バス (Seitetsu Bus), 相鉄バス (Sotetsu Bus), 函館バス (Hakodate Bus), 大阪地下鉄 (Osaka subway), 琴電 (Kotoden railways), 地下鉄御堂筋線 (Subway Midosuji Line), 芸陽バス (Geiyo Bus), 新京成バス (Shin-keisei Bus), jr

Random samples of extracted instances Type (Frequency) Example Transportation (54) 広島新幹線 (Hiroshima Super Express), 東海道線 (Tokaido Line), jr飯田線 (JR Iida Line), jr博多 (JR Hakata), 京都新幹線 (Kyoto Super Express) Accommodation (10) ホテルビーナス (Hotel Venus), リーガロイヤルホテル大阪 (Rihga Royal Hotel Osaka), www. route-inn. co. jp, ホテル京阪ユニバーサル・シティ (Hotel Keihan Universal City), 札幌全日空ホテル (ANA Hotel Sapporo) Travel Information (10) 外務省安全 (Foreign Ministry safety), チケットショップ大阪 (ticket shop Osaka), 観光関西 (Site seeing Kansai), 高山観光協会 (Takayama Tourism Association), グーグルナビ (Google navi) Travel Agency (6) jr おでかけネット (Odekake net), 近畿ツー (Kinki Tourist), タビックス静岡 (Tabix Shizuoka), フレックスインターナショナル (Flex International), オリオンツアー (Orion Tour) Others (2) プロテカ (Proteca; bag for travel), jal紀行倶楽部 (JAL Travel Club) Not Related (20) 格安航空チケット海外 (discount flight ticket overseas), 新幹線予約状況 (Super express reservation situation), 新幹線時刻表 (Super express timetable), 温泉宿 (spa accommodation), 新幹線停車駅 (Super express stops), 虎 (tiger), youtubu海外ドラマ (overseas drama), 法務部採用 (legal department recruitment), おくりびと (Okuribito; film), 社会人野球 (amateur baseball) 48

Learning curve of Quetchupclick The more the data, the higher the precision 49

Sensitivity to Quetchupclick to parameter α Clickthrough graph seems much denser than query graph, and large alpha (less weight on initial labels) yields better performance than small alpha 50

Conclusions • Semantic drift in Espresso is a parallel form of topic drift in HITS • The regularized Laplacian reduces semantic drift in two tasks: word sense disambiguation and named entity extraction • inherently a relatedness measure ( importance measure) 51

Future work • Investigate if a similar analysis is applicable to a wider class of bootstrapping algorithms (including co-training) • Investigate the influence of seed selection to bootstrapping algorithms and propose a way to select effective seed instances • Explore relation between Markov random walk and label propagation methods 52