Breaking through the syntax barrier Searching with entities

Breaking through the syntax barrier: Searching with entities and relations Soumen Chakrabarti IIT Bombay www. cse. iitb. ac. in/~soumen ECML/PKDD Chakrabarti 2004

Wish upon a textbox, 1996 Your information need here ECML/PKDD Chakrabarti 2004 2

Wish upon a textbox, 1998 Your information need here “A rising tide of data lifts all algorithms” ECML/PKDD Chakrabarti 2004 3

Wish upon a textbox, post-IPO Your information need (still) here § Now indexing 4, 285, 199, 774 pages § Same interface, therefore same 2 -word queries § Mind-reading wizard-in-black-box saves the day ECML/PKDD Chakrabarti 2004 4

If music had been invented ten years ago along with the Web, we would all be playing one-string instruments (and not making great music). Udi Manber, A 9. com Plenary speech WWW 2004 ECML/PKDD Chakrabarti 2004 5

Examples of “great music”… § The commercial angle: they’ll call you • Want to buy X, find reviews and prices • Cheap tickets to Y § Noun-phrase + Pagerank saves the day • Find info about diabetes • Find the homepage of Tom Mitchell § Searching vertical portals: garbage control • Searching Citeseer or IMDB through Google § Someone out there had the same problem • +adaptec +aha 2940 uw +lilo +bios ECML/PKDD Chakrabarti 2004 6

… and not-so-great music § Which produces better responses? • Opera fails to connect to • secure IMAP tunneled through SSH opera connect imap ssh tunnel § Unable to express many details of information need • Opera the email client, not a kind of music • The problem is with Opera, not ssh, imap, applet • “Secure” is an attribute of imap, but may not juxtapose ECML/PKDD Chakrabarti 2004 7

Why telegraphic queries fail § Information need relates to entities and relationships in the real world § But the search engine gets only strings § Risk over-/under- specified queries • Never know true recall • No time to deal with poor precision § Query word distribution dramatically different from corpus distribution • Query is inherently incomplete • Fix some info, look for other ECML/PKDD Chakrabarti 2004 8

1 So far, largely for power-users, can be generated by programs Domain of IR, time to analyze queries much more deeply 2 “Free-format text” No schema; either map to schema or support query on uninterpreted graphs Bool, Prox XQuery SQL Bag-of-words Structure in the query ECML/PKDD Broad, deep and ad-hoc schema and data integration: Very difficult! Defined many real problems away; “solved” apart from performance issues 3 Defining, labeling, extracting and ranking answers are the major issues; no universal models; need many more applications HTML XML Entity+relations Structure in the corpus Chakrabarti 2004 9

Past the syntax barrier: early steps 1§ Taking the question apart • Question has known parts and unknown “slots” • Query-dependent information extraction § Compiling relations from the Web 2 • is-instance-of (is-a), is-subclass-of • is-part-of, has-attribute § Graph models for textual data 3 • Searching graphs with keywords and twigs • Global probabilistic graph labeling ECML/PKDD Chakrabarti 2004 10

Part-1 Working harder on the question ECML/PKDD Chakrabarti 2004

Atypes and ground constants § Specialize given domain to a token related to ground constants in the query • What animal is Winnie the Pooh? § instance-of(“animal”) NEAR “Winnie the Pooh” • When was television invented? § instance-of(“time”) NEAR “television” NEAR synonym(“invented”) § FIND x NEAR Ground. Constants(question) WHERE x IS-A Atype(question) • Ground constants: Winnie the Pooh, television • Atypes: animal, time ECML/PKDD Chakrabarti 2004 12

Taking the question apart § Atype: the type of the entity that is an answer to the question § Problem: don’t want to compile a classification hierarchy of entities • Laborious, can’t keep up • Offline rather than question-driven § Instead • Set up a very large basis of features • “Project” question and corpus to basis ECML/PKDD Chakrabarti 2004 13

Scoring tokens for correct Atypes § FIND x “NEAR” Ground. Constants(question) WHERE x IS-A Atype(question) § No fixed question or answer type system § Convert “x IS-A Atype(question)” to a soft match Does. Atype. Match(x, question) Passage Question Answer tokens IE-style surface feature extractors Question feature vector IS-A feature extractors Learn joint distrib. …other extractors… Snippet feature vector ECML/PKDD Chakrabarti 2004 14

Features for Atype matching § Question features: 1, 2, 3 -token sequences starting with standard wh-words • where, when, who, how_X, … § Passage surface features: has. Cap, has. Xx, is. Abbrev, has. Digit, is. All. Digit, lpos, rpos, … § Passage IS-A features: all generalizations of all noun senses of token • Use Word. Net: horse equid ungulate, hoofed • ECML/PKDD mammal placental mammal animal… entity These are node IDs (“synsets”) in Word. Net, not strings Chakrabarti 2004 15

Supervised learning setup § Get top 300 passages from IR engine • “Promising but negative” instances • Crude approximation to active learning § For each token invoke feature extractors § Question vector xq, passage vector xp • How to represent combined vector x? § Label = 1 if token is in answer span, 0 o/w • Question and answers from logs ECML/PKDD Chakrabarti 2004 16

Joint feature-vector design § Obvious “linear” juxtaposition x =(xp, xq) • Does not expose pairwise dependencies § “Quadratic” form x = xq xp • All pairwise product of elements § Model has param for every pair how_far when what_city region#n#3 entity#n#1 § Can discount for redundancy in pair info § If xq (xp) is fixed, what xp (xq) will yield the largest Pr(Y=1|x)? ECML/PKDD Chakrabarti 2004 17

Classification accuracy § Pairing more accurate than linear model § Are the estimated w parameters meaningful? § Given question, can return most favorable answer feature weights ECML/PKDD Chakrabarti 2004 18

Parameter anecdotes § Surface and Word. Net features complement each other § General concepts get negative params: use in predictive annotation § Learning is symmetric (Q A) ECML/PKDD Chakrabarti 2004 19

Taking the question apart ü Atype: the type of the entity that is an answer to the question § Ground constants: Which question words are likely to appear (almost) unchanged in an answer passage? § Arises in Web search sessions too • Opera login fails • problem with login Opera email • Opera login accept password • Opera account authentication • … ECML/PKDD Chakrabarti 2004 20

Features to identify ground constants § Local and global features • POS of word, POS of adjacent words, case • info, proximity to wh-word Suppose word is associated with synset S § Num. Sense: size of S (is word very polysemous? ) § Num. Lemma: average #lemmas describing s S (are there many aliases? ) § Model as a sequential learning problem • Each token has local context and global features POS@-1 POS@0 POS@1 § Label: does token appear near answer? ECML/PKDD Chakrabarti 2004 21

Ground constants: sample results § Global features (IDF, Num. Sense, Num. Lemma) essential for accuracy • Best F 1 accuracy with local features alone: 71— 73% • With local and global features: 81% § Decision trees better than logistic regression • F 1=81% as against LR F 1=75% • Intuitive decision branches ECML/PKDD Chakrabarti 2004 22

Summary of the Atype strategy § “Basis” of atypes A, a A could be synset, surface pattern, feature of a parse tree § Question q “projected” to vector (wa: a A) in atype space via learning conditional model § If q is “when…” or “how long…” whas. Digit and wtime_period#n#1 are large, wregion#n#1 is small § Each corpus token t has associated indicator features a(t ) for every a § has. Digit(3, 000)= is-a(region#n#1)(Japan)=1 ECML/PKDD Chakrabarti 2004 23

Reward proximity to ground constants § A token t is a candidate answer if Projection of question to “atype space” Atype indicator features of the token § Hq(t ): Reward tokens appearing “near” ground constants matched from question …the armadillo, found in Texas, is covered with strong horny plates § Order tokens by decreasing ECML/PKDD Chakrabarti 2004 24

Evaluation: Mean reciprocal rank (MRR) § nq = smallest rank among answer passages § MRR = (1/|Q|) q Q(1/nq) • Dropping passage from #1 to #2 as bad as dropping it from #2 to not reporting it at all Experiment setup: § 300 top IR score passages § If Pr(Y=1|token) < threshold reject token § If tokens rejected reject passage § Points below diagonal are good ECML/PKDD Chakrabarti 2004 25

Sample results § Accept all tokens IR baseline MRR § Moderate acceptance threshold non-answer passages eliminated, improves answer ranks § High threshold true answers eliminated • Another answer with poor rank, or rank = § Additional benefits from proximity filtering ECML/PKDD Chakrabarti 2004 26

Part-2 Compiling fragments of soft schema ECML/PKDD Chakrabarti 2004

Who provides is-a info? § Compiled KBs: Word. Net, CYC § Automatic “soft” compilations • Google sets • Know. It. All • Bio. Text § Can use as evidence in scoring answers ECML/PKDD Chakrabarti 2004 28

Extracting is-instance-of info § Which researcher built the WHIRL system? • Word. Net may not know Cohen IS-A researcher § Google has over 4. 2 billion pages • “william cohen” on 86100 (p 1=86. 1 k/4. 2 B) • researcher on 4. 55 M (p 2=4. 55 M/4. 2 B) • +researcher +"william cohen“ on 1730: 18. 55 x more frequent than expected if independent § Pointwise mutual information PMI § Can add high-precision, low-recall patterns • “cities such as New York” (26600 hits) • “professor Michael Jordan” (101 hits ) ECML/PKDD Chakrabarti 2004 29

Bootstrapping lists of instances § Hearst 1992, Brin 1997, Etzioni 2004 § A “propose-validate” approach • Using existing patterns, generate queries • For each web page w returned § Extract potential fact e and assign confidence score § Add fact to database if it has high enough score § Example patterns • NP 1 {, } {such as|and other|including} NPList 2 • NP 1 is a NP 2, NP 1 is the NP 2 of NP 3 • the NP 1 of NP 2 is NP 3 § Start with NP 1 = researcher etc. ECML/PKDD Chakrabarti 2004 30

System details § The importance of shallow linguistics working together with statistical tests • China is a (country)NP in Asia • Garth Brooks is a (country. ADJ (singer)N)NP § Unary relation example • NP 1 such as NPList 2 & “Head” of phrase head(NP 1)=plural(name(Class 1)) & proper. Noun(head(each(NPList 2))) instance. Of(Class 1, head(each(NPList 2)) ) ECML/PKDD Chakrabarti 2004 31

Compilation performance § Recall-vs-precision exposes size and difficulty of domain • “US state” is easy • “Country” is difficult § To improve signal-tonoise (STN) ratio, stop when confidence score is lower than threshold • Substantially improves recall-vs-precision ECML/PKDD Chakrabarti 2004 32

Exploiting is-a info for ranking Passage Question IE-style surface feature extractors Answer tokens Atype IE-style surface feature extractors WN IS-A feature extractors Question feature vector Learn joint distrib. PMI scores from search engine probes Snippet feature vector § Use PMI scores as additional features § Challenge: make frugal use of expensive inverted index probes ECML/PKDD Chakrabarti 2004 33

1 So far, largely for power-users, can be generated by programs Domain of IR, time to analyze queries much more deeply 2 “Free-format text” No schema; either map to schema or support query on uninterpreted graphs Bool, Prox XQuery SQL Bag-of-words Structure in the query ECML/PKDD Broad, deep and ad-hoc schema and data integration: Very difficult! Defined many real problems away; “solved” apart from performance issues 3 a: Schema-free search on labeled graphs 3 b: Labeling graphs 3 Defining, labeling, extracting and ranking answers are the major issues HTML XML Entity+relations Structure in the corpus Chakrabarti 2004 34

Mining graphs: applications abound Time Email 2 Could I get a preprint of your recent ECML paper? Email. To ECML Email 1 Email. Date …your ECML submission titled XYZ has been accepted… Email. Date PDF file XYZ Last. Mod A. Thor Abstract: … XYZ A. U. Thor Email. To Canonical node Want to quickly find this file § No clean schema, data changes rapidly § Lots of generic “graph proximity” information ECML/PKDD Chakrabarti 2004 35

Part-3 a Searching graphs ECML/PKDD Chakrabarti 2004

Some useful graph query styles § Find single node with specified type strongly activated by given nodes • paper NEAR { author=“A. U. Thor” • conference=“ECML” time=“now” } May not know exact types of activators § Find connected subgraph that best explains how/why two or more nodes are related • connect {“Faloutsos” “Roussopulous } • Reward short paths, parallel paths, few distractions § Query is a small graph with match clauses ECML/PKDD Chakrabarti 2004 37

Measuring activation of single nodes movie (near “richard gere” “julia roberts”) http: //www. cse. iitb. ac. in/banks/ ECML/PKDD Chakrabarti 2004 38

Reporting connecting subgraphs http: //www. cse. iitb. ac. in/banks/ ECML/PKDD Chakrabarti 2004 39

Schema-free twig queries § Query is a small graph, very light on types • Node type (dblpauthor, vldbjournal) • Simple predicates to qualify nodes (string, number) § Match to data graph • Precisely defined matches for local predicates • Reward for proximity ECML/PKDD Chakrabarti 2004 40

Sample twig result VLDB journal article, 2001 ECML/PKDD Gerhard Weikum’s home page Chakrabarti 2004 Max-Planck Institute home page 41

Main issues § What style of queries to support? • No end-user will type Xquery directly • But bag-of-words is too weak § How to rank result nodes and subgraphs? • Reward for text match, graph proximity, … • “Iceberg queries”: report top few after a lot of computation? § Can ranking be learnt as with question answering? • How to reuse and retarget supervision? ECML/PKDD Chakrabarti 2004 42

Part-3 b Probabilistic graph labeling ECML/PKDD Chakrabarti 2004

Identifying node types § How do we know a node has a given type? • Find person near “Pagerank” • Find student near (homepage of Tom Mitchell) • Find {verb, ####, “television”} close together § Nodes in our graph have • Visible attributes “hobbies”, “my advisor is”, • • “start-of-sentence” Invisible labels “student”, “verb” Neighbors “HREF” “linear juxtaposition” § Goal: given some node labels, guess all others ECML/PKDD Chakrabarti 2004 44

Example: Hypertext classification § Want to assign labels to Web pages § Text on a single page may be too little or misleading § Page is not an isolated instance by itself § Problem setup • Web graph G=(V, E) • Node u V is a page having text u. T • Edges in E are hyperlinks • Some nodes are labeled • Make collective guess at missing labels § Probabilistic model? Benefits? ECML/PKDD Chakrabarti 2004 45

Graph labeling model § Seek a labeling f of all unlabeled nodes so as to maximize For the moment we don’t need to worry about this § Let VK be the nodes with known labels and f(VK) their known label assignments § Let N(v) be the neighbors of v and NK(v) N(v) be neighbors with known labels § Markov assumption: f(v) is conditionally independent of rest of G given f(N(v)) ECML/PKDD Chakrabarti 2004 46

Markov graph labeling Probability of labeling specific node v… …given edges, text, parital labeling Sum over all possible labelings of unknown neighbors of v Markov assumption: label of v does not depend on unknown labels outside the set NU(v) § Circularity between f(v) and f(NU(v)) § Some form of iterative Gibb’s sampling or MCMC ECML/PKDD Chakrabarti 2004 47

Iterative labeling in pictures § c=class, t=text, N=neighbors § Text-only model: Pr[c|t] § Using neighbors’ labels to update my estimates: Pr[c|t, c(N)] § Repeat until histograms stabilize • Local optima possible ECML/PKDD Chakrabarti 2004 ? 48

Iterative labeling: formula Label estimates of v in the next iteration Joint distribution over neighbors approximated as product of marginals Take the expectation of this term over NU(v) labelings § Sum over all possible NU(v) labelings still too expensive to compute § In practice, prune to most likely configurations By the Markov assumption, finally we need a distribution coupling f(v) and v. T (the text on v) and f(N(v)) ECML/PKDD Chakrabarti 2004 49

Labeling results § 9600 patents from 12 classes marked by USPTO § Patents have text and cite other patents § Expand test patent to include neighborhood § ‘Forget’ fraction of neighbors’ classes § Good for other data sets as well ECML/PKDD Chakrabarti 2004 50

Undirected Markov networks § Clique c C(G) a set of completely connected nodes § Clique potential c(Vc) a function over all possible configurations of nodes in Vc § x = vector of observed features at all nodes § y = vector of labels at all nodes (mostly unknown) ECML/PKDD Label coupling Chakrabarti 2004 Instance Local feature variable Label variable 51

Conditional Markov networks § Decompose conditional label probabilities into product over clique potentials This sum is often a mess, so we need to get back to Gibb’s sampling/MCMC anyway § Parametric model for clique potentials Params of model ECML/PKDD Chakrabarti 2004 Feature functions 52

Markov network: results § Improves beyond textonly § Need more comparisons with iterative labeling § Can add many redundant features, e. g. are two links in the same HTML “section”? • Further improves accuracy ECML/PKDD Chakrabarti 2004 53

Graph labeling: Important special cases § Implementation issues • Look for large cliques • Still need sampling to estimate Z § Trees and chains have cliques of size 2 § Estimation involves • Iterative numerical optimization (as before) • No sampling; direct estimation of Z possible § Many applications • Part-of-speech and named-entity tagging • Disambiguation, alias analysis ECML/PKDD Chakrabarti 2004 54

Summary § Extracting more structure from the query • This talk: using linguistic features • In general: query/click sessions, profile info, … § Compiling fragments of structure from a large corpus • Supervised sequence models • Semi-supervised bootstrapping techniques § Labeling and searching graphs • Coarse grained e. g. Web graph, node = page • Fine grained e. g. node = person, email, … § Pieces sometimes need each other ECML/PKDD Chakrabarti 2004 55

1 So far, largely for power-users, can be generated by programs Domain of IR, time to analyze queries much more deeply 2 “Free-format text” No schema; either map to schema or support query on uninterpreted graphs Bool, Prox XQuery SQL Bag-of-words Structure in the query ECML/PKDD Broad, deep and ad-hoc schema and data integration: Very difficult! Defined many real problems away; “solved” apart from performance issues 3 Defining, labeling, extracting and ranking answers are the major issues; no universal models; need many more applications HTML XML Entity+relations Structure in the corpus Chakrabarti 2004 56

Concluding messages § Work much harder on questions • Break down into what’s known, what’s not • Find fragments of structure when possible • Exploit user profiles and sessions § Perform limited pre-structuring of corpus • Difficult to anticipate all needs and applications • Extract graph structure where possible (e. g. is-a) • Do not insist on specific schema § Exploit statistical tools on graph models • Models of influence along links getting clearer • Estimation not yet ready for very large-scale use • Much room for creative feature and algo design ECML/PKDD Chakrabarti 2004 57