993ca62b376bb122a1c9d96b2a9d8b14.ppt
- Количество слайдов: 15
Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Archana vijayalakshmanan 4/11/2006
Contents n n n Introduction Different ranking functions Breaking ties Implementation Conclusion
Introduction n Automated ranking of the results of the query is popular aspect of IR. Database system support only a boolean query model. ¨ Empty answers ¨ Many answers Automated ranking of query results is taking user query and mapping to Top-K query with ranking function.
Automated Ranking functions for the ‘Empty Answers Problem’ n IDF Similarity n QFIDF Similarity
IDF Similarity <attribute , value> w tuple d n IR technique § Database(only categorical attribute) T=<t 1, ……tm> Q=set of key words Q=<q 1, …. . . qm> Condition is “WHERE is A 1=q 1” IDF(w)=log(N/F(w)) IDFk(t)=log(n/Fk(t)) TF(w, d)=Frequency of occurance of w in d Cosine similarity between query and document is normalized dot product of the two corresponding vector n-number of tuples in database Fk(t) -Frequency of tuples in database where Ak=t Similarity between T and Q is Sum of corresponding similarity coefficients over all attributes • dot product is un-normalized Similarity function known as cosine similarity with TF-IDF weightings • TF is irrelavant Similarity function known as IDF similarity Eg query={CONVERTIBLE, NISSAN}
Generalizations of IDF similarity n For numeric data ¨ Inappropriate to use previous similarity coefficients. n frequency of numeric value depends on nearby values. Discretizing numeric to categorical attribute is problematic. ¨ Solution: ¨ n {t 1, t 2…. . tn} be the values of attribute A. For every value t, sum of”contributions” of t from every other point ti contributions modeled as gaussian distribution n Similarity function is bandwidth parameter n For range/set of values
QF Similarity n n Importance of attribute values is determined by frequency of their occurence in workload For categorical data query frequency QF(q)= raw frequency of occurrence of value q of attribute A in query strings of workload (RQF(q) raw frequency of most frequently occuring value in workload (RQFMax) ¨ § n s(t, q)= QF(q), if q=t 0 , otherwise Similarity between pairs of different categorical attribute values can also be derived from workload eg. To find S(TOYOTA, HONDA), Analyzing IN clauses of queries: If certain pair of values often occur together in the workload , they are similar. e. g queries with C as “MFR IN {TOYOTA, HONDA, NISSAN}” ¨ Several recent queries in workload by a specific user repeatedly requesting for TOYOTA and HONDA. ¨
QFIDF Similarity n QF is purely workload-based. Big disadvantage for insufficient or unreliable workloads n . For QFIDF Similarity ¨ S(t, q)=QF(q) *IDF(q) when t=q where QF(q)=(RQF(q)+1)/(RQFMax+1). ¨ Thus we get small non zero value even if value is never referenced in workload model
Breaking ties n n Problem: Many tuples may tie for the same similarity score and get ordered arbitarily. Arise in empty and many answers problem. Solution: Determine the weights of missing attribute values that reflect their “global importance” for ranking purposes by using workload information. Extend QF similarity , use quantity to break ties. ¨ Extending IDF similarity by using IDF values presents challenges. ¨
Implementation Pre-processing component n Query–processing component n
Pre-processing component n Compute and store a representation of similarity function in auxiliary database tables. For categorical data, compute IDF(t) (resp QF(t)) , to compute frequency of occurences of values in database and store the results in auxillary database tables. ¨ For numeric data, an approximate representation of smooth function IDF() (resp(QF()) is stored, so that function value is retrieved at runtime. ¨
Query processing component n Main task: Given a query Q and an integer K, retrieve Top-K tuples from the database using one of the ranking functions. Ranking function extracted in pre-processing phase. ¨ SQL-DBMS for solving top-K problem. ¨ n Handling simpler query processing problem Input: table R with M categorical columns, Key column TID, C is conjunction of form Ak=qk. . . and integer K. ¨ Output: top-K tuples of R similar to Q. ¨ Similarity function: Overlap Similarity. ¨
Implementation of Top-K operator n n Traditional approach Indexed based approach ¨ overlap similarity function satisfies the following monotonic property. Adapt TA algorithm If T and U are two tuples such that for all K, Sk(tk, qk)< Sk(uk, qk) then SIM(T, Q) < SIM(U, Q) To adapt TA implemented Sorted and random access methods. ¨ Performs sorted access for each attribute, retrieve complete tuples with corresponding TID by random access and maintains buffer of Top-K tuples seen so far. ¨
Indexed-based TA(ITA) Sorted access Random access
Conclusion n n Thus TF-IDF based techniques were extended to numerical and mixed data. Workload tracking was used as a weak form of collaborative filtering.
993ca62b376bb122a1c9d96b2a9d8b14.ppt