- Количество слайдов: 23
A Ranking Algorithm for Semantic search engine – spam and fake detection case study By: Soheila Dehghanzadeh. Web technology lab weekly seminars.
Agenda : • Web spam definition • A brief overview of Search engines • Search engine phases: – Crawling – Index lookup – Ranking lookup results • My proposed ranking algorithm
Web spam and fake: • In web of data anyone is able to say anything about anything. • Low quality data should not be mentioned in top search results.
A Search Engine:
A Search Engine:
web of data vs. web of documents. • WODoc. No: link type and no trustworthiness (just popularity). • WOData: should consider link type and link context (for provenance and proof of trust).
Crawling & Indexing phase… • Using ldspider to crawl linked data. • Using hexastore for complete indexing the crawled data. Special thanks to Panagiotis Karras for providing hexastore implementation in python.
• Index lookup results for extension… result ranking lookup indexing Raw rdf Web of data Crawler some Results may not include keyword but they have high quality and relevance. Result expansion to hide the locality effect. Some sites is referred many times but in this special context other professional sites lookup results are more interested.
Hexa. Store • Index structure that we use in our search engine. • Each RDF element type deserve to have special index structure build round it. • Every possible ordering of the importance or precedence of the three elements in the indexing scheme is materialized. • Each index structure in a hexastore centers around one RDF element and defines a perioritation between the other 2 elements.
Sample spo indexing in a hexastore Si P(I, 1) P(I, 2) O(i 1, 1) O(i 2, 1) O(i 1, 2) O(i 2, 2) O(i 1, ki 1) O(i 2, ki 2) P(I, Ni) O(i. Ni, 1) Space complexity : Spo+sp+pso+so+ pos+po O(i. Ni, 2) O(i. Ni, ki. Ni)
My idea! • • Import the base result set to jena and extend it. Extending the base set with ontology reasoning rules so that extra resources and relations will be added through reasoning rules. The added resources The added relation has no context so their trustworthiness is an aggregation function on (x, y, rule) relations--Resources will be added only through same. As predicate Resources will be ranked according to relevance to query terms (using ontobroker – pagerank – objectrank- triplerank – HITS, …. ) Query – Keyword query – Structured query – Ontology based query (using an interface to get query) - ontobroker • • Relation (properties) will be ranked according to contexts(provenance) using relation ranking methods such as sem. Rank or we can look at context’s page. Rank. Note that First we rank resources and second we rank relations. However it depends on the user query whether it is looking for relations or resources.
• Lookup on quads for keyword (Soheila) • Q 1: http: //um. s 11, givenname, ”Soheila”, UM • Q 2: http: //NIOC/p 25, fullname, ”Soheila Dehghan”, NIOC • Q 3: http: //nigc-khrz/e 66, firstname, ”Soheila”, NIGC • Q 4: http: //fake/f 4, name, ”Soheila”, fake B. Gates Da ew ith ( Buy(Spam) N) nc CN et( Me Scott Cheese FK ) Q 1 SA(LI) http: //linked. In/ u 12 SA(UM) SA(FK) Q 4 Q 3 SA(FK) http: //facebook /u 122 SA(NIGC) Q 2 SA(FB) http: //isport /us 122
Result set expansion methods: • step 1: using sameas predicate on found Qaudes and extend Result. Set to Q 1, …, Qr • index Look. Up • – Q 1(S), Same. As, ? Qr(S), Same. As, ? – ? , Same. As, Q 1(S), ? ? , Same. As, Qr(S), ? (Q 1, …, Q 4, FBURI, Linked. In. URI, isport. URI) in our case. • apply PR on Extended graph with Same. As which Same. As links are replaced with PR weight of same. As context. (to know the trustwothiness of each contexts).
Result set expansion methods: • Step 2: Look. Up all properties of Q 1(s), …, Qr(s) – Q 1(s), ? , ? —? , Q 1(s), ? –… – Qr(s), ? , ? —? , Qr(s), ? • Step 4: add inferred relation using domain ontology(context is composed of ontology+inference process) • Step 4: rank Q 1, …, Qr according to their Tpage. Rank (computed online from graph of step 1 ), rank relations according to their context page. Rank(which is computed by Google offline) • Note : contexts who has PR lower than a treshhold won’t be mentioned. they maybe Spam or Fake Sites.
Structured query on quads indexes • Single pivot: – (S, ? , ? ), (? , p, ? ), (? , o. ? ), (? , ? . C) • Double pivot: – (S, p, ? ), (s, ? , O, ? ), (s, ? , C), (? , P, O, ? ), (? , P, ? , C), (? , o, C) • Triple pivot: – (s, p, o, ? ), (s, p, ? , c), (s, ? , o, c), (? , p, o, c)
• Step 1: if the specified parts was URI then a direct lookup is performed by search engine. Otherwise if user have specified keyword for each parts then firstly a keyword search will be done and then for each result URI a lookup will be performed.
Lookup on quads for ontological queries
Dehghan. Zadeh (GAS) Worked at(GAS) GAS Owl: same. As(FUM) Soheila (FUM) Studied in(FUM) FUM Supervisor (FUM) Kahani (FUM) Owl: same. As(NIOC) Sally(NIOC) played in(NIOC) NIOC Team
Related works for ranking web of data… • • Objectrank Ding Sindice ti-idf Entity. Rank. sem. Rank Re. Con. Rank ontobroker…
Proof of trust • Jena inference Explanation will be used to represent as a proof of trust
Evaluation … • Compare Spam ranks • Compare query time • Compare index size
Best things in he life are free. Thanks for attention.