Скачать презентацию A NEW TOPIC QUERIES WITH GEO-INFORMATION WEB MOBILE GROUP Скачать презентацию A NEW TOPIC QUERIES WITH GEO-INFORMATION WEB MOBILE GROUP

6012d8598a21f79d0de7d6a72d181c1a.ppt

  • Количество слайдов: 42

A NEW TOPIC: QUERIES WITH GEO-INFORMATION WEB&MOBILE GROUP Zheng Huo A NEW TOPIC: QUERIES WITH GEO-INFORMATION WEB&MOBILE GROUP Zheng Huo

SIX TOPICS RELATED • Spatial pattern mining – – – Mining Interesting Locations and SIX TOPICS RELATED • Spatial pattern mining – – – Mining Interesting Locations and Travel Sequences from GPS Trajectories [WWW 09] Where. Next: a Location Predictor on Trajectory Pattern Mining [SIGKDD 09] Migration Motif: A Spatial-Temporal Pattern Mining Approach for Financial Markets[SIGKDD 09] • Social network • Opinion – – – Ruxia Ma Jing Zhao Rated Aspect Summarization of Short Comments [WWW 09] How Opinions are Received by Online Communities: A Case Study on Amazon. com Helpfulness Votes [WWW 09] Opinion. Miner: A Machine Learning System for Web Opinion Mining and Extraction [SIGKDD 09] • Geo+query intention – – Xiangmei Hu Zheng Huo Discovering Users' Specific Geo Intention in Web Search [WWW 09] A Probabilistic Topic-Based Ranking Framework for Location-Sensitive Domain Information Retrieval [SIGIR 09] Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects [VLDB 09] Keyword Search in Spatial Databases Towards Searching by Document [ICDE 09] • Geographic + image • k. NN applications 2/43

OUTLINE • Background • Overview • Methods – Sdir – GIU method – Top-k OUTLINE • Background • Overview • Methods – Sdir – GIU method – Top-k – Others • Conclusions & future work 3/43

BACKGROUND • Many web queries contain geo info – About 30% queries may have BACKGROUND • Many web queries contain geo info – About 30% queries may have geo intent; about half of them have explicit geo info. • Such as queries like “Italian restaurant”, ”Car dealer”, ”L. A hotel” • About 13% queries have a place name – 84% of them have explicit city info. – 2. 6% have state info. – 13. 4% have country info. • Can be used in many fields, such as – Recommendation System – Improve users’ search experience – Advertisement matching 4/43

BACKGROUND(cont’) • Why traditional methods can’t solve this problem perfectly? Q(Location, terms) Scores of BACKGROUND(cont’) • Why traditional methods can’t solve this problem perfectly? Q(Location, terms) Scores of “textual relevance” 1. Spatial relevance is computed through “Euclidean Distance” which is not suitable for all the cases Scores of “Spatial relevance” Hybrid Score 1. Use a linear function to combine them, which is not the best method Ranking 5/43

OUTLINE • Background • Overview • Methods – Sdir – GIU method – Top-k OUTLINE • Background • Overview • Methods – Sdir – GIU method – Top-k – Others • Conclusions & future work 6/43

OVERVIEW Explicit Geoinformation Queries with Geo-information Queries like ”Beijing Hotels” “Paris toggery ” SDIR OVERVIEW Explicit Geoinformation Queries with Geo-information Queries like ”Beijing Hotels” “Paris toggery ” SDIR method GIU methods Queries like “Italian Restaurant” “Dentist” Top-k query Local info Spatial query Local geo-info Neighbor info Implicit Geoinformation Specific region Neighborhood geo-info Other…. Queries like “Car dealer” “Real estate” Queries like “State Maps” “Hotels” 7/43

OUTLINE • Background • Overview • Methods – Sdir – GIU method – Top-k OUTLINE • Background • Overview • Methods – Sdir – GIU method – Top-k – Others • Conclusions & Future work 8/43

A TOPIC-BASED METHOD: SDIR • An example A Piece of News: q 1 : A TOPIC-BASED METHOD: SDIR • An example A Piece of News: q 1 : “Los Angeles basketball game” q 2 : “Houston basketball game” q 3 : “Boston basketball game” Search engine or IR system There is an NBA match review regarding the match between L. A. Lakers and Rockets (from Houston), in which some other teams such as Boston Celtics are mentioned Briefly. Web pages & documents ……………… A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’ 09 9/43

SDIR(cont’) • Problem definition – DEFINITION 1. A spatial query is expressed as q SDIR(cont’) • Problem definition – DEFINITION 1. A spatial query is expressed as q = (q. S, q. T ), in which q. S represents the geographical condition implied by q and q. T represents the search terms that exclude location names. – DEFINITION 2. When evaluated against spatial queries, a document can be viewed as d = (d. S, d. T ), in which d. S is the list of location names found in d and d. T represents document texts. – We can define the ranking function as: F(q, d) = F(q. T , q. S, d. T , d. S) Assume that spatial relevance and textual relevance are independent, we can write it as F(q, d) = FT (q. T , d. T )⊕ FS(q. S, d. S) A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’ 09 10/43

SDIR(cont’) • Framework of SDIR(Spatial-related Topic: Domine Information Retrievel) A generalized abstraction Q-T Relevance SDIR(cont’) • Framework of SDIR(Spatial-related Topic: Domine Information Retrievel) A generalized abstraction Q-T Relevance of document contents Each NBA team is a topic ϕ(q, t), evaluate relevance between a query and a topic Topic Layer: In the middle of query layer and document layer, consists of topics D-T Relevance Topic Center: A location which the topic is about. For the team Rockets, Houston is topic center ψ(d, t), evaluate relevance between a document and a topic A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’ 09 11/43

SDIR(cont’) • Some formulas F(q, d) = FT (q. T , d. T )⊕ SDIR(cont’) • Some formulas F(q, d) = FT (q. T , d. T )⊕ FS(q. S, d. S) F(q, d) =∑ϕ(q, tj)ψ(d, tj )ωtj (q, d) 1. It worked directly between the query and the document 2. Popular IR metrics can be used here, such as tf-idf and cosine function 3. Here, the author used a extended version j j tj of the tf-idf method ϕ(q, t)=p(t|q) ψ(d, t)=p(d|t) F(q, d) = ∑p(d|t )p(t |q)ω (q, d) Bayesian Theory F(q, d) ∝ ∑p(tj |q. S)p(tj |q. T )p(tj |d. S)p(tj |d. T )ωtj (q, d) / p( tj ) Obtained from topic model Can be directly obtained from the training set A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’ 09 12/43

SDIR(cont’) This method is domain-based, the author trained a model which domain is “NBA SDIR(cont’) This method is domain-based, the author trained a model which domain is “NBA basketball games”. This is location related because most fans are interested in local teams • How to learn the topic model? Determine which domain you are focused on Data Collection Topic documents : Crawl data from well supported web sites, including : NBA official site, ESPN , and Yahoo! Sport Funs : at least 10, 000 geo-record for each team Find the suitable distribution model Use GP classifier to Model 1. Returns probabilistic results for class labels, perfectly match ranking purpose. 2. GP is no parametric and does not place prior assumptions 3. GP is a kernel machine, which is highly flexible and configurable A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’ 09 13/43

SDIR(cont’) • Procedure Overall LTS grid 1 q. T p(t 1|g 2) p(t 2|g SDIR(cont’) • Procedure Overall LTS grid 1 q. T p(t 1|g 2) p(t 2|g 2) …… …… w 1 p(t 1|w 1) p(t 2|w 1) …… w 2 p(t 1|w 2) p(t 2|w 2) …… …… Query (q) …… …… LTT p(t 2|g 2) grid 2 q. S p(t 1|g 1) …… …… …… Geographical Influence Lookup Table: LTS divides the entire geo-area into small grids with the same sizes. ϕ(q, tj) Term-Topic Lookup Table: for example, given m topics. q. T ωtj (q, d) Document (d) F(q, d) Inverted Index d 1 P(t 1|d 1) P(t 2|d 1) …… d 2 P(t 1|d 2) P(t 2|d 2) …… …… …… ψ(d, tj) A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’ 09 14/43

SDIR(cont’) • Implementation – Data set: Take the NBA topic for example – Training SDIR(cont’) • Implementation – Data set: Take the NBA topic for example – Training set: Documents crawled from ESPN/NBA team pages are as labeled with corresponding teams. At least 10, 000 records for each team. – Geo-Grid: cut the entire US main territory into smaller square grids, each of which is 0. 2°× 0. 2° A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’ 09 15/43

SDIR(cont’) 2 -team distributions Celtics(+1) VS Bulls(-1) 5 -team distributions Celtics, Bulls, Rockets, Lakers, SDIR(cont’) 2 -team distributions Celtics(+1) VS Bulls(-1) 5 -team distributions Celtics, Bulls, Rockets, Lakers, Suns A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’ 09 16/43

SDIR(cont’) Location: Simulate a user from 4 locations Query: “MVP” (implicit geo-info) Euclidean distance SDIR(cont’) Location: Simulate a user from 4 locations Query: “MVP” (implicit geo-info) Euclidean distance is not suitable for this. For people from Pitts prefer Boston to Cleveland although Cleveland is much nearer 17/43

SDIR(cont’) • Pros and cons – Highly ranking qualities on query with Geoinformation. – SDIR(cont’) • Pros and cons – Highly ranking qualities on query with Geoinformation. – Suitable for explicit and implicit geo queries. – BUT it is domain based, each topic model must be trained separately. – Topics must have only one center, can’t deal with multiple centers in one topic. A Probabilistic Topic-based Ranking Framework for Location-sensitive Domain Information Retrieval In SIGIR’ 09 18/43

OUTLINE • Background • Overview • Methods – Sdir – GIU method – Top-k OUTLINE • Background • Overview • Methods – Sdir – GIU method – Top-k – Others • Conclusions & future work 19/43

GIU METHOD • Overview of the system Discovering Users’ Specific Geo Intention in Web GIU METHOD • Overview of the system Discovering Users’ Specific Geo Intention in Web Search WWW’ 09 20/43

GIU METHOD(cont’) • Classifier 1: detect implicit geo intent Qc 200 Cheap hotel 150 GIU METHOD(cont’) • Classifier 1: detect implicit geo intent Qc 200 Cheap hotel 150 49 ers 125 Zoo San Francisco Freqency Pizza Use WOE tool Qnc 100 Q = w 1 · · ·wn wi. The probability is the strings composed the is of each word query conditioned on the identity of the previous word For each city Ck, build bigram language model Discovering Users’ Specific Geo Intention in Web Search WWW’ 09 21/43

GIU METHOD(cont’) • City language model Uniform distribution – Calculate the posterior probability Attention! GIU METHOD(cont’) • City language model Uniform distribution – Calculate the posterior probability Attention! Obtained from last formula The city language is built. From now on when we related to a city, it means a city in the city language model, Not the geo one. If the probability is high, it means the query is related to this city instead of the meaning the query is generated from that city. Discovering Users’ Specific Geo Intention in Web Search WWW’ 09 22/43

GIU METHOD(cont’) • Overall data description – Three learning tasks • Classifier I: Detecting GIU METHOD(cont’) • Overall data description – Three learning tasks • Classifier I: Detecting implicit geo queries • Classifier II: Discriminating different localization capabilities of geo queries: local geo intent, neighbor region geo intent, etc. • City language models: Predicting geo entities related to a query Discovering Users’ Specific Geo Intention in Web Search WWW’ 09 23/43

GIU METHOD(cont’) • Implementation – Use real world web search logs from Yahoo! – GIU METHOD(cont’) • Implementation – Use real world web search logs from Yahoo! – Training subset I • Randomly sample 20, 000 implicit geo queries and 20, 000 non-geo queries • All the explicit geo queries in the training set are used to generate the city language model(CLM) Discovering Users’ Specific Geo Intention in Web Search WWW’ 09 24/43

GIU METHOD(cont’) • Generating labels DN+ DN- Step 1: get the clicked url for GIU METHOD(cont’) • Generating labels DN+ DN- Step 1: get the clicked url for each query (domain name) Step 2: Identify queries in DN+ Randomly sample 20, 000 implicit geo queries and Step 3: Identify queries in DN 20, 000 non-geo queries to train classifiers. 67 DNs in DN+, 64 DNs in Step 4: non-location parts of positive samples as the final implicit DNgeo intent queries Discovering Users’ Specific Geo Intention in Web Search WWW’ 09 25/43

GIU METHOD(cont’) • Evaluate the classifiers Discovering Users’ Specific Geo Intention in Web Search GIU METHOD(cont’) • Evaluate the classifiers Discovering Users’ Specific Geo Intention in Web Search WWW’ 09 26/43

GIU METHOD(cont’) • Evaluating Classifier II Implicit geo queries Discriminate LG, NRG, RG Classifier GIU METHOD(cont’) • Evaluating Classifier II Implicit geo queries Discriminate LG, NRG, RG Classifier II The result of the classification formed training subset II Discovering Users’ Specific Geo Intention in Web Search WWW’ 09 LG NG RG 27/43

GIU METHOD(cont’) • Training models evaluation – The training data is the training subset GIU METHOD(cont’) • Training models evaluation – The training data is the training subset II Low dimensional features All features Discovering Users’ Specific Geo Intention in Web Search WWW’ 09 The classifiers classify the queries generated from city Level. The result of this step formed the training subset III / testing subset III. 28/43

GIU METHOD(cont’) • Location-specific query discovery A threshold To tune ta with training subset GIU METHOD(cont’) • Location-specific query discovery A threshold To tune ta with training subset III Discovering Users’ Specific Geo Intention in Web Search WWW’ 09 29/43

GIU METHOD(cont’) • Conclusions of GIU method WOE tool Detect the implicit geo intent, GIU METHOD(cont’) • Conclusions of GIU method WOE tool Detect the implicit geo intent, using a probability of the co-occurrence of a city and a query. CLM is generated here. Discovering Users’ Specific Geo Intention in Web Search WWW’ 09 Discriminate LG, NG and RG geo intention, predict the location of the entity in Q 30/43

GIU METHOD(cont’) • Pros and cons – Can be used in explicit and implicit GIU METHOD(cont’) • Pros and cons – Can be used in explicit and implicit geo queries both. – Compared to topic-based method, GIU method is more flexible and useful. – BUT query log based method is constrained – The classifiers are not improved, the performance is not quite good. Discovering Users’ Specific Geo Intention in Web Search WWW’ 09 31/43

OUTLINE • Background • Overview • Methods – Sdir – GIU method – Top-k OUTLINE • Background • Overview • Methods – Sdir – GIU method – Top-k – Others • Conclusions & future work 32/43

TOD-K • Introduction O 1 O 5 O 6 O 2 Questions: • Q TOD-K • Introduction O 1 O 5 O 6 O 2 Questions: • Q O 3 Local geo info O 4 Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’ 09 • How to present location proximity and text relevancy? What kind of index to combine both location proximity and text relevancy? 33/43

TOP-K • A simple example Efficient Retrieval of the Top-k Most Relevant Spatial Web TOP-K • A simple example Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’ 09 34/43

TOP-K • Hybrid index Objects & bounding recs A IR-tree Efficient Retrieval of the TOP-K • Hybrid index Objects & bounding recs A IR-tree Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’ 09 35/43

TOP-K • IR-tree algorithm front R 7 R 5, 0. 05119 R 6, 0. TOP-K • IR-tree algorithm front R 7 R 5, 0. 05119 R 6, 0. 269 R 2, 0. 1048 R 1, 0. 238 R 6, 0. 269 O 3, 0. 481 O 4, 0. 517 O 8, 0. 686 O 1, 0. 238 R 6, 0. 269 O 3, 0. 481 O 2, 0. 512 O 4, 0. 517 Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’ 09 O 8, 0. 686 36/43

TOP-K • DIR-tree Bounding rectangles focused only on location proximity Efficient Retrieval of the TOP-K • DIR-tree Bounding rectangles focused only on location proximity Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’ 09 37/43

TOP-K • DIR-tree(cont’) DIR-tree Top-2 Efficient Retrieval of the Top-k Most Relevant Spatial Web TOP-K • DIR-tree(cont’) DIR-tree Top-2 Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’ 09 38/43

TOP-K • Conclusions – Proposed a new indexing framework for location aware top-k text TOP-K • Conclusions – Proposed a new indexing framework for location aware top-k text retrieval. – The frameworks integrates the inverted file for text retrieval and the R-tree for spatial proximity querying in a novel manner. – BUT it is only used for users to search local geo-information. Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects VLDB’ 09 40/43

OUTLINE • Background • Overview • Methods – Sdir – GIU method – Top-k OUTLINE • Background • Overview • Methods – Sdir – GIU method – Top-k – Others • Conclusions & future work 41/43

CONCLUSIONS & FUTURE WOK • Research of discovering users’ implicit geo intention is hot CONCLUSIONS & FUTURE WOK • Research of discovering users’ implicit geo intention is hot these years. – Some existing method based on large data training models, which is hard to adjust and used to other domains. – If it is local geo information, it comes to the question of k. NN. • Except training methods, is there other way to model users’ implicit geo intention? 42/43

Thanks Q&A? 43/43 Thanks Q&A? 43/43