19bbd84813175ee7a99d5280fa8c984b.ppt
- Количество слайдов: 83
XML Full-Text Search: Challenges and Opportunities Sihem Amer-Yahia AT&T Labs – Research Jayavel Shanmugasundaram Cornell University 2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline • • • Motivation Full-Text Search Languages Scoring Query Processing Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
Motivation • XML is able to represent a mix of structured and text information. • XML applications: digital libraries, content management. • XML repositories: IEEE INEX collection, Lexis. Nexis, the Library of Congress collection. 2 September 2005 VLDB Tutorial on XML Full-Text Search
THOMAS: Library of Congress 2 September 2005 VLDB Tutorial on XML Full-Text Search
Abstract— This paper examines the problem of predicting the timing behavior of knowledge-based systems for real… 2 September 2005 VLDB Tutorial on XML Full-Text Search
INEX Data
Challenges in XML FT Search • Searching over Semi-Structured Data – Users may specify a search context and return context. • Expressive Power and Extensibility – Users should be able to express complex full-text searches and combine them with structural searches. • Scores and Ranking – Users may specify a scoring condition, possibly over both full-text and structured predicates and obtain top-k results based on query relevance scores. – The language should allow for an efficient implementation. 2 September 2005 VLDB Tutorial on XML Full-Text Search
XML FT Search Definition • Context expression: XML elements searched: – pre-defined XML nodes. – XPath/XQuery queries. • Return expression: XML fragments returned: – pre-defined meaningful XML fragments. – XPath/XQuery to build answers. • Search expression: FT search conditions: – Boolean keyword search. – proximity distance, scoping, thesaurus, stop words, stemming. • Score expression: – system-defined scoring function. – user-defined scoring function. – query-dependent keyword weights. 2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline • • • Motivation Full-Text Search Languages Scoring Query Processing Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
Four Classes of Languages • Keyword search (INEX Content-Only Queries) – “book xml” • Tag + Keyword search – book: xml • Path Expression + Keyword search – /book[. /title about “xml db”] • XQuery + Complex full-text search – for $b in /book let score $s : = $b ftcontains “xml” && “db” distance 5 2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline • Motivation • Full-Text Search Languages – – Simple Keyword Search Tags + Keyword Search Path Expressions + Keyword Search XQuery + Complex Full-Text Search • Scoring • Query Processing • Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
XRank [Guo et al. , SIGMOD 2003]
XRank [Guo et al. , SIGMOD 2003]
XIRQL [Fuhr & Grobjohann, SIGIR 2001]
Similar Notion of Results • Nearest Concept Queries – [Schmidt et al. , ICDE 2002] • XKSearch – [Xu & Papakonstantinou, SIGMOD 2005] 2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline • Motivation • Full-Text Search Languages – – Simple Keyword Search Tags + Keyword Search Path Expressions + Keyword Search XQuery + Complex Full-Text Search • Scoring • Query Processing • Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
XSearch [Cohen et al. , VLDB 2003]
Outline • Motivation • Full-Text Search Languages – – Simple Keyword Search Tags + Keyword Search Path Expressions + Keyword Search XQuery + Complex Full-Text Search • Scoring • Query Processing • Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
XPath [W 3 C 2005] • fn: contains($e, string) returns true iff $e contains string //section[fn: contains(. /title, “XML Indexing”)] 2 September 2005 VLDB Tutorial on XML Full-Text Search
XIRQL [Fuhr & Grobjohann, SIGIR 2001] • Weighted extension to XQL (precursor to XPath) //section[0. 6 ·. //* $cw$ “XQL” + 0. 4 ·. //section $cw$ “syntax”] 2 September 2005 VLDB Tutorial on XML Full-Text Search
XXL [Theobald & Weikum, EDBT 2002] • Introduces similarity operator ~ Select Z From http: //www. myzoos. edu/zoos. html Where zoos. #. zoo As Z and Z. animals. (animal)? . specimen as A and A. species ~ “lion” and A. birthplace. #. country as B and A. region ~ B. content 2 September 2005 VLDB Tutorial on XML Full-Text Search
NEXI [Trotman & Sigurbjornsson, INEX 2004] • Narrowed Extended XPath I • INEX Content-and-Structure (CAS) Queries //article[about(. //title, apple) and about(. //sec, computer)] 2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline • Motivation • Full-Text Search Languages – – Simple Keyword Search Tags + Keyword Search Path Expressions + Keyword Search XQuery + Complex Full-Text Search • Scoring • Query Processing • Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
Schema-Free XQuery [Li, Yu, Jagadish, VLDB 2003] • Meaningful least common ancestor (mlcas) for $a in doc(“bib. xml”)//author $b in doc(“bib. xml”)//title $c in doc(“bib. xml”)//year where $a/text() = “Mary” and exists mlcas($a, $b, $c) return
XQuery Full-Text [W 3 C 2005] • Two new XQuery constructs 1) FTContains. Expr • • Expresses “Boolean” full-text search predicates Seamlessly composes with other XQuery expressions 2) FTScore. Clause • • Extension to FLWOR expression Can score FTContains. Expr and other expressions 2 September 2005 VLDB Tutorial on XML Full-Text Search
FTContains. Expr //book ftcontains “Usability” && “testing” distance 5 //book[. /content ftcontains “Usability” with stems]/title //book ftcontains /article[author=“Dawkins”]/title 2 September 2005 VLDB Tutorial on XML Full-Text Search
FTScore Clause In any FOR $v [SCORE $s]? IN [FUZZY] Expr order LET … WHERE … ORDER BY … RETURN Example FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing”] ORDER BY $s RETURN
FTScore Clause In any FOR $v [SCORE $s]? IN [FUZZY] Expr order LET … WHERE … ORDER BY … RETURN Example FOR $b SCORE $s in /pub/book[. ftcontains “Usability” && “testing” and. /price < 10. 00] ORDER BY $s RETURN $b 2 September 2005 VLDB Tutorial on XML Full-Text Search
FTScore Clause In any FOR $v [SCORE $s]? IN [FUZZY] Expr order LET … WHERE … ORDER BY … RETURN Example FOR $b SCORE $s in FUZZY /pub/book[. ftcontains “Usability” && “testing”] ORDER BY $s RETURN $b 2 September 2005 VLDB Tutorial on XML Full-Text Search
XQuery Full-Text Evolution 2002 Quark Full-Text Language (Cornell) IBM, Microsoft, Oracle proposals Te. XQuery 2003 (Cornell, AT&T Labs) 2004 XQuery Full-Text 2005 XQuery Full-Text (Second Draft) 2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline • • • Motivation Full-Text Search Languages Scoring Query Processing Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
Full-Text Scoring • Score value should reflect relevance of answer to user query. Higher scores imply a higher degree of relevance. • Queries return document fragments. Granularity of returned results affects scoring. • For queries containing conditions on structure, structural conditions may affect scoring. • Existing proposals extend common scoring methods: probabilistic or vector-based similarity. 2 September 2005 VLDB Tutorial on XML Full-Text Search
Granularity of Results • Keyword queries – compute possibly different scores for LCAs. • Tag + Keyword queries – compute scores based on tags and keywords. • Path Expression + Keyword queries – compute scores based on paths and keywords. • XQuery + Complex full-text queries – compute scores for (newly constructed) XML fragments satisfying XQuery (structural, full-text and scalar conditions). 2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline • Motivation • Full-Text Search Languages • Scoring – – Simple Keyword Search Tags + Keyword Search Path Expressions + Keyword Search XQuery + Complex Full-Text Search • Query Processing • Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
Granularity of Results • Document as hierarchical structure of elements as opposed to flat document. – XXL [Theobald & Weikum, EDBT 2002] – XIRQL [Fuhr & Grobjohann, SIGIR 2001] – XRANK [Guo et al. , SIGMOD 2003] • Propagate keyword weights along document structure. 2 September 2005 VLDB Tutorial on XML Full-Text Search
XML Data Model Containment edge
XXL [Theobald & Weikum, EDBT 2002] • Compute similar terms with relevance score r 1 using an ontology. • Compute tf*idf of each term for a given element content with relevance score r 2. • Relevance of an element content for a term is r 1*r 2. • r 1 and r 2 are computed as a weighted distance in an ontology graph. • Probabilities of conjunctions multiplied (independence assumption) along elements of same path to compute path score. 2 September 2005 VLDB Tutorial on XML Full-Text Search
Probabilistic Scoring XIRQL [Fuhr & Grobjohann, SIGIR 2001] • Extension of XPath. • Weighting and ranking: – weighting of query terms: • P(wsum((0. 6, a), (0. 4, b)) = 0. 6 · P(a)+0. 4 · P(b) – probabilistic interpretation of Boolean connectors: • P(a && b) = P(a) · P(b) 2 September 2005 VLDB Tutorial on XML Full-Text Search
XIRQL Example • Query: – “Search for an artist named Ulbrich, living in Frankfurt, Germany about 100 years ago” • Data: – “Ernst Olbrich, Darmstadt, 1899” • Weights and ranking: – P(Olbrich p Ulbrich)=0. 8 (phonetic similarity) – P(1899 n 1903)=0. 9 (numeric similarity) – P(Darmstadt g Frankfurt)=0. 7 (geographic distance) 2 September 2005 VLDB Tutorial on XML Full-Text Search
Page. Rank [Brin & Page 1998] d/3 d/3 2 September 2005 : Hyperlink edge d: Probability of following hyperlink w 1 -d: Probability of random jump VLDB Tutorial on XML Full-Text Search
Elem. Rank [Guo et al. SIGMOD 2003] d 1/3 : Hyperlink edge d 3 d 1/3 : Containment edge w d 1/3 d 2/2 2 September 2005 d 2/2 d 1: Probability of following hyperlink d 2: Probability of visiting a subelement d 3: Probability of visiting parent 1 -d 2 -d 3: Probability of random jump VLDB Tutorial on XML Full-Text Search
Outline • Motivation • Full-Text Search Languages • Scoring – – Simple Keyword Search Tags + Keyword Search Path Expressions + Keyword Search XQuery + Complex Full-Text Search • Query Processing • Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
XSearch [Cohen et al. , VLDB 2003] • tf*ilf to compute weight of keyword for a leaf element. • A vector is associated with each non-leaf element. • sim(Q, N): sum of the cosine distances between the vectors associated with nodes in N and vectors associated with terms matched in Q. 2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline • Motivation • Full-Text Search Languages • Scoring – – Simple Keyword Search Tags + Keyword Search Path Expressions + Keyword Search XQuery + Complex Full-Text Search • Query Processing • Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
Vector–based Scoring Juru. XML [Mass et al INEX 2002] • Transform query into (term, path) conditions: article/bm/bibl/bb[about(. , hypercube mesh torus nonnumerical database)] • (term, path)-pairs: hypercube, article/bm/bibl/bb mesh, article/bm/bibl/bb torus, article/bm/bibl/bb nonnumerical, article/bm/bibl/bb database, article/bm/bibl/bb • Modified cosine similarity as retrieval function for vague matching of path conditions. 2 September 2005 VLDB Tutorial on XML Full-Text Search
Juru. XML Vague Path Matching • Modified vector-based cosine similarity Example of length normalization: cr (article/bibl, article/bm/bibl/bb) = 3/6 = 0. 5 2 September 2005 VLDB Tutorial on XML Full-Text Search
Query Relaxation on Structure • Schlieder, EDBT 2002 • Delobel & Rousset, 2002 • Amer-Yahia et al, VLDB 2005 2 September 2005 VLDB Tutorial on XML Full-Text Search
XML Query Relaxation [Amer-Yahia et al EDBT 2002] Flex. Path [Amer-Yahia et al SIGMOD 2004] Query book • Tree pattern relaxations: – Leaf node deletion – Edge generalization – Subtree promotion Data 2 September 2005 author Dickens info edition (paperback) author Charles Dickens edition paperback book info author C. Dickens book edition? info edition paperback VLDB Tutorial on XML Full-Text Search author Dickens
Adaptation of tf. idf to XML Whirlpool[Marian et al ICDE 2005] Document Collection (Information Retrieval) XML Document XML Node (result is a subtree rooted at a returned node with a given tag and satisfying structural predicates in the query) Keyword(s) Tree Pattern idf (inverse document frequency) is a function of the fraction of documents that contain the keyword(s) idf is a function of the fraction of returned nodes that match the query tree pattern tf (term frequency) is a function of the tf is a function of the number of ways number of occurrences of the keyword in the query tree pattern matches the document returned node 2 September 2005 VLDB Tutorial on XML Full-Text Search
A Family of XML Scoring Methods [Amer-Yahia et al VLDB 2005] book Query • Twig scoring – High quality – Expensive computation info • Path scoring • Binary scoring edition (paperback) author (Dickens) – Low quality – Fast computation book info edition (paperback) author (Dickens) 2 September 2005 book + book info edition (paperback) book + book author info edition (Dickens) (paperback) author (Dickens) VLDB Tutorial on XML Full-Text Search
Outline • Motivation • Full-Text Search Languages • Scoring – – Simple Keyword Search Tags + Keyword Search Path Expressions + Keyword Search XQuery + Complex Full-Text Search • Query Processing • Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
XIRQL + Relaxation • XIRQL proposes vague predicates but it is not clear how to combine it with all of XQuery. • Open issue as how to relax all of XQuery including structured and scalar predicates. 2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline • • • Motivation Full-Text Search Languages Scoring Query Processing Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline • • Motivation Full-Text Search Languages Scoring Query Processing – – Simple Keyword Search Tags + Keyword Search Path Expressions + Keyword Search XQuery + Complex Full-Text Search • Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
Main Issue • Given: Query keywords • Compute: Least Common Ancestors (LCAs) that contain query keywords, in ranked order 2 September 2005 VLDB Tutorial on XML Full-Text Search
Naïve Method
Dewey Encoding of IDs [1850 s]
XQL Po sit e or Sc De we y Id ion L ist XRank: Dewey Inverted List (DIL) 5. 0. 3. 0. 0 85 32 8. 0. 3. 8. 3 38 89 91 … … … Ricardo 5. 0. 3. 0. 1 82 38 8. 2. 1. 4. 2 99 52 … … … Sorted by Dewey Id Store IDs of elements that directly contain keyword - Avoids space overhead 2 September 2005 VLDB Tutorial on XML Full-Text Search
DIL: Query Processing • Merge query keyword inverted lists in Dewey ID Order – Entries with common prefixes are processed together • Compute Longest Common Prefix of Dewey IDs during the merge – Longest common prefix ensures most specific results – Also suppresses spurious results • Keep top-m results seen so far in output heap – Calculate rank using two-dimensional proximity metric – Output contents of output heap after scanning inverted lists • Algorithm works in a single scan over inverted lists 2 September 2005 VLDB Tutorial on XML Full-Text Search
XRank: Ranked Dewey Inverted List (RDIL) B+-tree On Dewey Id XQL Inverted List … Sorted by Score …(other keywords) 2 September 2005 VLDB Tutorial on XML Full-Text Search
RDIL: Algorithm • An element may be ranked highly in one list and low in another list – B+-tree helps search for low ranked element • When to stop scanning inverted lists? – Based on Threshold Algorithm [Fagin et al. , 2002], which periodically calculates a threshold – Can stop if we have sufficient results above threshold – Extension to most specific results 2 September 2005 VLDB Tutorial on XML Full-Text Search
RDIL: Query Processing P P Output Heap Temp Heap B+-tree on Dewey Id Ricardo P: 9. 0. 4. 2. 0 Rank(9. 0. 4) XQL Inverted List threshold = Score(P)+Score(R) threshold = Score(P)+Max-Score R 9. 0. 4. 1. 2 8. 2. 1. 4. 2 9. 0. 4. 1. 2 9. 0. 5. 6 10. 8. 3 B+-tree on Dewey Id 9. 0. 4. 2. 0 2 September 2005 VLDB Tutorial on XML Full-Text Search
ID Order vs. Rank Order • Approaches that combine benefits • Long ID inverted list, short score inverted list – HDIL (Guo et al. , SIGMOD 2003) • Chunk inverted list based on score, organize by ID within chunk – Flex. Path (Amer-Yahia et al. , SIGMOD 2004) – SVR (Guo et al. , ICDE 2005) 2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline • • Motivation Full-Text Search Languages Scoring Query Processing – – Simple Keyword Search Tags + Keyword Search Path Expressions + Keyword Search XQuery + Complex Full-Text Search • Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
XSearch Technique • Given: An interconnection relationship R between nodes (semantic relationship) – R is reflexive and symmetric • Node interconnection index – Given two nodes n and n’ in a document d, find if (n, n’) are in R* • Use dynamic programming to compute closure – Online vs. offline 2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline • • Motivation Full-Text Search Languages Scoring Query Processing – – Simple Keyword Search Tags + Keyword Search Path Expressions + Keyword Search XQuery + Complex Full-Text Search • Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
XXL Indexing • Element Path Index (EPI) – Evaluates simple path expressions • Element Content Index (ECI) – Traditional inverted list (but replicates nested elements) • Ontology Index (OI) – Lookup similar concepts (for evaluating ~e) – Returned in ranked order 2 September 2005 VLDB Tutorial on XML Full-Text Search
Do cu El me em nt El ent ID em ID e Pr nt T ob ab ag ili El em ty en t. T Pr ob ag ab ili El em ty en t. T Pr ag ob ab ili ty Myaeng et al. [SIGIR 1994] XQL 5 85 act 0. 3 play 0. 2 plays 0. 1 … … 2 September 2005 VLDB Tutorial on XML Full-Text Search
book 2 info author 4 edition title 5 3 Do 1 cu St me art nt En ID ID d. I De D In pth de Sc x ID or e Integrating Structure and IL [Kaushik et al. , SIGMOD 2004] XQL 5 85 99 3 5 0. 9 … … B+ Tree 2 September 2005 VLDB Tutorial on XML Full-Text Search
Outline • • Motivation Full-Text Search Languages Scoring Query Processing – – Simple Keyword Search Tags + Keyword Search Path Expressions + Keyword Search XQuery + Complex Full-Text Search • Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
Scoring Functions Critical for Top-k Query Processing • Top-k answer quality depends on scoring function. • Efficient top-k query processing requires scoring function to be: – Monotone. – Fast to compute. 2 September 2005 VLDB Tutorial on XML Full-Text Search
Structural Join Relaxation //book[. /info[. /author ftcontains “Dickens”] [. /edition ftcontains “paperback”]] contains(edition, ”paperback”) paperback contains(author, ”Dickens”) Dickens pc(info, edition) or ad(book, edition) pc(info, edition) edition pc(info, author) author pc(book, info) book info 2 September 2005 pc(book, info) or ad(book, info) book info VLDB Tutorial on XML Full-Text Search
Quark/Gala. Tex Architecture
Outline • • • Motivation Full-Text Search Languages Scoring Query Processing Open Issues 2 September 2005 VLDB Tutorial on XML Full-Text Search
System Architecture Integration Layer XQuery Engine 2 September 2005 IR Engine VLDB Tutorial on XML Full-Text Search
System Architecture XQuery + IR Engine Quark/Gala. Tex use this architecture 2 September 2005 VLDB Tutorial on XML Full-Text Search
Structural Relaxation FOR $b SCORE $s in FUZZY /pub/book[. ftcontains “Usability” with stems] ORDER BY $s RETURN $b 2 September 2005 VLDB Tutorial on XML Full-Text Search
Search Over Views Data Source 1
Other Open Issues • Extensive experimental evaluation of scoring functions and ranking algorithms for XML (INEX). • Joint scoring on full-text and scalar predicates. • Score-aware algebra for XML for the joint optimization of queries on both structure and text. 2 September 2005 VLDB Tutorial on XML Full-Text Search
Backup Slides 2 September 2005 VLDB Tutorial on XML Full-Text Search
Why not use SQL/MM (or variant)? • Key difference: No strict demarcation between structured and text data in XML – Can issue structured and text queries over same data • Find books with year > 1995 • Find books containing keyword “ 1998” – Can embed structured queries in text queries • Find books that contain the keywords that occur in the title of Richard Dawkins’ books • Other important differences – XML/XQuery data model – Composability of full-text primitives 2 September 2005 VLDB Tutorial on XML Full-Text Search
Scoring Function (monotonicity) book • Required properties: book – Exact matches should be info edition info scored higher than relaxed info edition info (paperback) matches (idf) (paperback) author – Returned elements with author title (Dickens) several matches should be (Dickens) (Great ranked higher than those with Expectations) fewer matches (tf) • How to combine tf and idf? – tf. idf, as used by IR, violates above properties – Ranking based on idf, then breaking ties using tf satisfies the properties 2 September 2005 (a) (b) score(a) >= score(b) <= VLDB Tutorial on XML Full-Text Search


