Trust based web spam detection in semantic search

Trust based web spam detection in semantic search engine By: Soheila Dehghanzadeh

What is a web spam ﻳﻜﻲ ﺍﺯ ﻭﻱژگﻴﻬﺎﻱ ﺳﻴﺴﺘﻢ ﻫﺎﻱ ﻣﻮﻓﻖ ﺍﻃﻼﻋﺎﺗﻲ ﺗﻮﺳﻂ ﻣﻴﺰﺍﻥ ﺣﻤﻠﻪ ﻱ ﺍﺳپﻤﺮﻫﺎ ﺑﻪ آﻨﻬﺎ ﻣﺸﺨﺺ ﻣﻲ ﺷﻮﺩ. ﺻﻔﺤﺎﺕ ﺍﺳپﻢ ﺩﺭ ﻭﺏ ﺍﺯ ﺗکﻨیکﻬﺎی ﻣﺨﺘﻠﻔی ﺑﺮﺍی ﺭﺳیﺪﻥ ﺑﻪ ﺭﺗﺒﻪ ﻫﺎی ﺑﺎﻻ ﺩﺭ ﻧﺘﺎیﺞ ﺟﺴﺘﺠﻮی ﻣﻮﺗﻮﺭﻫﺎی ﺟﺴﺘﺠﻮ ﻭ گﻤﺮﺍﻩ کﺮﺩﻥ آﻨﻬﺎ ﺍﺳﺘﻔﺎﺩﻩ ﻣیکﻨﻨﺪ. ﻣﻮﺗﻮﺭﻫﺎی ﺟﺴﺘﺠﻮ ﺑﺎیﺪ ﻭیژگیﻬﺎی ﺩﻭگﺎﻧﻪ ی کیﻔیﺖ ﻧﺘﺎیﺞ ﻭ ﻣﺮﺗﺒﻂ ﺑﻮﺩﻥ ﺭﺍ ﺑﺎ ﻫﻢ ﻟﺤﺎﻅ کﻨﻨﺪ ﺗﺎ ﺑﺘﻮﺍﻥ ﺍﺯ ﺣﺠﻢ ﺯیﺎﺩ ﺍﻃﻼﻋﺎﺕ ﺭﻭی ﻭﺏ ﺍﺳﺘﻔﺎﺩﻩ کﺮﺩ. ﺩﺭ ﺗکﻨیکﻬﺎی ﺑﻬیﻨﻪ ﺳﺎﺯی ﻣﻮﺗﻮﺭ ﺟﺴﺘﺠﻮﻭ ﺑﺎﺯیﺎﺑی ﺭﻗﺎﺑﺘی ﺍﻃﻼﻋﺎﺕ ﻫﺪﻑ یﺎﻓﺘﻦ ﺗﺎﺑﻊ ﻧﻤﺮﻩ ﺩﻫی ﻣﻮﺗﻮﺭﺟﺴﺘﺠﻮ ﻭ ﺑﺎﻻﺑﺮﺩﻥ ﻣﺼﻨﻮﻋی ﺭﺗﺒﻪ ی یک ﺻﻔﺤﻪ ﺩﺭ ﻧﺘﺎیﺞ ﺑﺎﺯیﺎﺑی ﺷﺪﻩ ﺍﺳﺖ، ﺗﺎ ﺑﺘﻮﺍﻥ ﺍﺯ ﻣﻨﺎﻓﻊ ﺗﺠﺎﺭی ﺻﻔﺤﺎﺗی کﻪ ﺩﺭ ﺭﺗﺒﻪ ﻫﺎی ﺑﺎﻻ ﻇﺎﻫﺮ ﻣی ﺷﻮﻧﺪ ﺍﺳﺘﻔﺎﺩﻩ کﺮﺩ. ﺑﺎ ﺗﻮﺟﻪ ﺑﻪ ﻏیﺮ ﻣﻤکﻦ ﺑﻮﺩﻥ ﺍﺳﺘﻔﺎﺩﻩ ﺍﺯ ﻧیﺮﻭی ﺍﻧﺴﺎﻧی ﺑﺮﺍی کﺸﻒ ﺻﻔﺤﺎﺕ ﺍﺳپﻢ، ﺑﺎیﺪ ﺍیﻦ ﻓﺮآیﻨﺪ ﺭﺍ ﺧﻮﺩکﺎﺭ کﺮﺩ ﻭ چﻮﻥ ﺍﺳپﻤﺮﻫﺎ ﻣﺘﻨﺎﻭﺑﺎ ﺗکﻨیکﻬﺎی ﺧﻮﺩ ﺭﺍ ﺗﻐییﺮ ﻣیﺪﻫﻨﺪ ﺗﺎ ﻣﻮﺗﻮﺭﻫﺎی ﺟﺴﺘﺠﻮ ﺭﺍ گﻤﺮﺍﻩ کﻨﻨﺪ، ﻣﻘﺎﺑﻠﻪ ی ﺍﺗﻮﻣﺎﺗیک ﺑﺎ آﻨﻬﺎ ﺧیﻠی ﺩﺷﻮﺍﺭ ﺍﺳﺖ.

) Spamming techniques(WODoc v ﺗﻘﺮﻳﺒﺎ ﻫﺮ 2 -3 ﺭﻭﺯ ﻳﻚ ﺗﻜﻨﻴﻚ ﺟﺪﻳﺪ ﺑﺮﺍﻱ گﻤﺮﺍﻩ ﻛﺮﺩﻥ ﻣﻮﺗﻮﺭﻫﺎﻱ ﺟﺴﺘﺠﻮ ﺍﺭﺍﺋﻪ ﻣﻲ ﺷﻮﺩ. v ﻧﻜﺘﻪ ﻱ ﻣﻬﻢ ﺍﻳﻨﺴﺖ ﻛﻪ ﺗﻜﻨﻴﻜﻬﺎﻱ ﺍﺳپﻤﺮﻫﺎ ﻛﺎﻣﻼ ﻭﺍﺑﺴﺘﻪ ﺑﻪ ﺍﻟگﻮﺭﻳﺘﻢ ﻫﺎﻱ ﺭﻧﻜﻴﻨگ ﺩﺭ آﻦ ﻣﻮﺗﻮﺭ ﺟﺴﺘﺠﻮ ﺍﺳﺖ. v ﺗﻜﻨﻴﻜﻬﺎﻱ ﺍﺳپﻤﺮﻫﺎ v ﺍﺳﺘﻔﺎﺩﻩ ﺍﺯ کﻠﻤﺎﺕ ﺑﺮﺍی ﺍیﺠﺎﺩ ﺍﺳپﻢ]1[ )ﺍﺳﺘﻔﺎﺩﻩ ﻱ ﺑﻴﺨﻮﺩﻱ ﺍﺯ ﻛﻠﻤﺎﺕ ﻣﻬﻢ ﺟﺴﺘﺠﻮ( v ﺍﺳﺘﻔﺎﺩﻩ ﺍﺯ ﻟیﻨک ﺑﺮﺍی ﺍیﺠﺎﺩ ﺍﺳپﻢ]2[ )گﻤﺮﺍﻩ ﻛﺮﺩﻥ (pagerank v ﺩﻭ ﻧﺴﺨﻪ ﺩﺭ ﻳﻚ آﺪﺭﺱ ﺑﺮﺍﻱ ﻛﺎﺭﺑﺮﺍﻥ ﻭ ﺑﺮﺍﻱ ﻣﻮﺗﻮﺭﻫﺎﻱ ﺟﺴﺘﺠﻮ]3[. [1] Term spamming [2] Link spamming [3] Cloaking

Spamming techniques(WOData) False Labelling Misdirection Schema Pollution Identity Assumption Bait and Switch Misattribution Data URI Embedding

False Labelling • the spammer simply asserts labelling triples that promote their message. Linked data systems often display the objects of these triples when labelling resources. If the spammer targets popular subject URIs then there is a higher chance of their message appearing for users of the Linked Data system. For example: • dbpdedia: London rdfs: label "Buy more Wensleydale". • <http: //danbri. org/foaf. rdf#danbri> foaf: name "Wensleydale fan".

Misdirection • attacker asserts triples using properties that are commonly used to provide links to human-readable content. In the attack, the triple objects are resources that contain the attacker's message. Systems that use these properties may inadvertently display links to the spammer's site and content: • dbpedia: London rdfs: see. Also <http: //example. com/buycheese>. • dbpedia: Tim_Berners-Lee foaf: is. Primary. Topic. Of <http: //example. com/buycheese>. • <http: //sws. geonames. org/3333196/> mo: wikipedia <http: //example. com/buycheese>.

Schema Pollution • In this attack all of the instance data is innocuous but some of the properties used in the data are labelled with the spammer's message. When rendering data for human use, many linked data systems will look for schema information to label unknown predicates. This attack causes those systems to display the spammer's message: • ex: thing dc: title "New study finds that mice can learn to sing. " ; a foaf: Document ; dc: subject "mouse behaviour" ; ex: prop "Journal of mouse psychology". ex: prop a rdfs: Property ; rdfs: label "Lowest Wensleydale prices at bargaincheeseshop. com". • This attack can be combined with False Labelling, attempting to inject a message into a commonly used schema: • dc: title rdfs: label "Lowest Wensleydale prices at bargaincheeseshop. com

Identity Assumption • minting URIs in one URI space and using owl: same. As to connect the resource to identical resources in other URI spaces. The attacker simply describes a resource that conveys their message and then uses owl: same. As to make it identical to popular resources. Most Linked Data systems recognise owl: same. As and aggregate all triples about any subjects declared to be identical. • ex: thing dc: title "Wensleydale: the mature, smooth cheese you will love. " ; owl: same. As dbpedia: The_Beatles ; owl: same. As dbpedia: Lady_Gaga ; owl: same. As dbpedia: True_Blood ; owl: same. As dbpedia: Harry_Potter.

Bait and Switch • In this vector, the spammer uses content negotiation to provide enticing linked data to machines and spam messages to humans. When a Linked Data system fetches a URI it indicates that it requires machine-readable data by sending an appropriate HTTP header. Web browsers under the control of a human will send a different value for the header so servers can distinguish machines from humans and send different information. The spammer can configure their server to send innocuous Linked Data to machines which, when visited by humans, display the spammer's message. (See my earlier post Is the semantic web destined to be a shadow? for some of the consequences of this separation of machine/human content)

Misattribution • Under this attack, the spammer attributes their message to someone they hope the recipient will trust. Linked Data systems may ingest this data and display the quotation with the source inadvertently misleading its users: • ex: 1 a bibo: Quote ; bibo: content "I always buy Wensleydale from bargaincheeseshop. com and so should you" ; dc: creator "Sergey Brin".

Data URI Embedding • In this attack vector the data itself is innocuous but the URIs used by the attacker use the data: scheme to embed the spam message. If these URIs are displayed to the user of a Linked Data system then they may click on them and trigger the message display. (example ) • dbpedia: London rdfs: see. Also <data: text/html; charset=utf 8; base 64, PGEga. HJl. Zj 0 ia. HR 0 c. Dov. L 2 V 4 YW 1 wb. GUu. Y 29 t. L 2 J 1 e. W No. ZWVz. ZSI+b. G 93 ZXN 0 IFdlbn. Ns. ZXlk. YWxl. IHBya. WNlczwv. YT 4=>. •

Spam conclusion • Most of these attack vectors can be countered through a whitelist provenance system, but they are not easy to scale. • One particular property of RDF where duplicate triples can be ignored makes it easy to bury spam inside billions of legitimate triples - simply take a copy of dbpedia and add a few spam triples. • A casual inspection of the dataset will more than likely just see the dbpedia triples, but a Linked Data system that already has those triples will ignore them and just add the spam triples

Saerch engine techniques to deal with web spam ﻣﻮﺗﻮﺭﻫﺎﻱ ﺟﺴﺘﺠﻮ ﻣﻬﻤﺘﺮﻳﻦ ﺩﺭﻭﺍﺯﻩ ﻫﺎﻱ ﻭﺭﻭﺩ ﺑﻪ ﻭﺏ ﻫﺴﺘﻨﺪ. ﻳﻚ ﺍﺻﻞ ﺑﺪﻳﻬﻲ ﺑﺮﺍﻱ ﻛﺸﻒ ﺍﺳپﻢ: : "ﺍﺣﺘﻤﺎﻝ ﺍیﻨکﻪ ﺍﺯ ﺻﻔﺤﺎﺕ ﺧﻮﺏ ﺑﺎ کیﻔیﺖ ﺑﺎﻻ ﺑﻪ ﺻﻔﺤﺎﺕ ﺍﺳپﻢ ﻟیﻨک ﻭﺟﻮﺩ ﺩﺍﺷﺘﻪ ﺑﺎﺷﺪ ﺧیﻠی کﻢ ﺍﺳﺖ. " ﺍﻳﻦ ﺍﺻﻞ پﺎﻳﻪ ﻱ ﺍﻟگﻮﺭﻳﺘﻢ Trust. Rank ﺍﺳﺖ. ﺍﻟگﻮﺭﻳﺘﻢ : Trust. Rank ﺍﻧﺘﺨﺎﺏ seed ﻭ ﻓﺮﺍﺧﻮﺍﻧﻲ ﺍﻭﺭﺍﻛﻞ ﺑﺎ ﺍﺳﺘﻔﺎﺩﻩ ﺍﺯ پیﺞ ﺭﻧک ﻣﻌکﻮﺱ) (Inverse pagerank ﻭ پیﺞ ﺭﻧک ﺑﺎﻻ) (High pagerank ﺍﻧﺘﺸﺎﺭ ﺍﻋﺘﻤﺎﺩ ﺍﺯ seed ﺑﻪ ﺳﺎﻳﺮ ﻭﺏ ﺳﺎﻳﺘﻬﺎ ﻭ ﺷﻨﺎﺧﺖ ﺍﺳپﻢ. ﺑﺎ ﺗﻮﺟﻪ ﺑﻪ . . pagerank ﺍﻟﺒﺘﻪ گﺎﻫی ﺍﻭﻗﺎﺕ ﺍﺳپﻤﺮﻫﺎ یک ﻟیﻨک ﺑﻪ ﺻﻔﺤﻪ ی ﺧﻮﺩ ﺩﺭ ﻗﺴﻤﺖ یﺎﺩﺩﺍﺷﺘﻬﺎی یک ﺻﻔﺤﻪ ی ﺧﻮﺏ ﻗﺮﺍﺭ ﻣیﺪﻫﻨﺪ ﻭ ﺑﻪ ﺍیﻦ ﺗﺮﺗیﺐ ﺍیﻦ ﺍﻟگﻮﺭیﺘﻢ ﺭﺍ ﺩچﺎﺭ ﻣﺸکﻞ ﻣیکﻨﻨﺪ. ﺍﻧﺘﺸﺎﺭ ﺍﻋﺘﻤﺎﺩ ﺑﺎیﺪ ﺑﺎ ﺍﻓﺰﺍیﺶ ﻓﺎﺻﻠﻪ ﺍﺯ ﻣﺠﻤﻮﻋﻪ ی ﺍﺻﻠی ﺗﻀﻌیﻒ ﺷﻮﺩ. ﺍﻟگﻮﺭﻳﺘﻢ trustrank ﺗﻤﺎﻳﺰﻱ ﺑﻴﻦ ﻟﻴﻨﻜﻬﺎﻱ ﻣﺘﻔﺎﻭﺕ ﻗﺎﺋﻞ ﻧﻤﻲ ﺷﻮﺩ. ﺑﺎﻳﺪ ﺑﺮﺍﻱ ﺗﻄﺒﻴﻖ ﺍﻳﻦ ﺍﻟگﻮﺭﻳﺘﻢ ﺩﺭ ﺩﺍﺩﻩ ﻫﺎﻱ پﻴﻮﻧﺪﻱ ﺑﺎﻳﺪ ﺍﻳﻦ ﺍﻟگﻮﺭﻳﺘﻢ ﺭﺍ ﺑﺮﺍﻱ ﺍﻧﻮﺍﻉ ﻣﺨﺘﻠﻒ ﻟﻴﻨﻜﻬﺎ ﺗﻄﺒﻴﻖ ﻛﺮﺩ.

ﻣﻌﻤﺎﺭﻱ ﻣﻮﺗﻮﺭ ﺟﺴﺘﺠﻮ

Ranking

A 2 layer Model for web of data

ﺍﺭﺯﻳﺎﺑﻲ • . ﻣﺴﺌﻠﻪ ی ﺷﻨﺎﺳﺎیی ﺍﺳپﻢ یک ﻣﺴﺌﻠﻪ ی کﻼﺳﺒﻨﺪی ﺍﺳﺖ. ﻭ ﺑﺎ ﺗﻮﺟﻪ ﺑﻪ ﺍیﻨکﻪ ﺩیﺘﺎﺳﺘی ﺑﺮﺍی ﺍیﻦ ﻣﻮﺿﻮﻉ ﺩﺭ ﻧﻈﺮ گﺮﻓﺘﻪ ﻧﺸﺪﻩ ﺍﺳﺖ ﻭ ﺗﺎکﻨﻮﻥ ﺗﺴﺘی ﺩﺭ ﺍیﻦ ﺯﻣیﻨﻪ ﺍﻧﺠﺎﻡ ﻧﺸﺪﻩ ﺍﺳﺖ ﺑﻨﺎﺑﺮﺍیﻦ ﺗﻤﺎﻣی ﺳﻪ ﺗﺎیی ﻫﺎیی کﻪ ﺗﻮﺳﻂ ﻣﻮﺗﻮﺭ ﺟﺴﺘﺠﻮی sindice ﺍیﻨﺪکﺲ ﺷﺪﻩ ﺭﺍ ﻣیگیﺮیﻢ ﻭ ﺍﻧﻬﺎ ﺭﺍ ﺑﻪ ﺩﻭ ﺩﺳﺘﻪ ی ﺍﺳپﻢ ﻭ ﻏیﺮ ﺍﺳپﻢ ﺗﻘﺴیﻢ ﺑﻨﺪی ﻣی کﻨیﻢ ﻭ ﻧﺘیﺠﻪ ﺭﺍ ﺑﺎ ﻧﺘیﺠﻪ ی ﺍﻟگﻮﺭیﺘﻢ ﻫﺎی ﻣﻌﺮﻭﻑ ﺩﺳﺘﻪ ﺑﻨﺪی ﻣﻘﺎیﺴﻪ ﻣی کﻨیﻢ. ﺍﺭﺯیﺎﺑی precision, recall ﻭ ﻣﻘﺎیﺴﻪ ی ﺍﻧﻬﺎ کﺎﺭﺍیی ﺍﻟگﻮﺭیﺘﻢ ﺭﺍ ﻧﺸﺎﻥ ﺧﻮﺍﻫﺪ ﺩﺍﺩ.