Web Characteristics I Adapted from Lectures by Prabhakar

Web Characteristics I Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning (Stanford) Prasad L 14 Web. Char. I 1

Search use … (i. Prospect Survey, 4/04, http: //www. iprospect. com/premium. PDFs/i. Prospect. Survey. Complete. pdf) 2

Without search engines the web wouldn’t scale 1. 2. No incentive in creating content unless it can be easily found – other finding methods haven’t kept pace (taxonomies, bookmarks, etc) The web is both a technology artifact and a social environment n “The Web has become the “new normal” in the American way of life; those who don’t go online constitute an ever-shrinking minority. ” – [Pew Foundation report, January 2005] 3

(Cont’d) 1. 2. 3. Search engines make aggregation of interest possible: n Create incentives for very specialized niche players n Economical – specialized stores, providers, etc n Social – narrow interests, specialized communities, etc The acceptance of search interaction makes “unlimited selection” stores possible: n Amazon, Netflix, etc Search turned out to be the best mechanism for advertising on the web, a $15+ B industry. n Growing very fast but entire US advertising industry $250 B – huge room to grow n Sponsored search marketing is about $10 B 4

Classical IR vs. Web IR 5

Basic assumptions of Classical Information Retrieval n n Corpus: Fixed document collection Goal: Retrieve documents with information content that is relevant to user’s information need 6

Classic IR Goal n Classic relevance n n n For each query Q and stored document D in a given corpus assume there exists relevance Score(Q, D) n Score is average over users U and contexts C Optimize Score(Q, D) as opposed to Score(Q, D, U, C) That is, usually: n Context ignored Bad n Individuals ignored assumptions n Corpus predetermined in the web context 7

Subscription Feeds Crawls Content creators Transaction Advertisement Editorial Web IR : The coarse-level dynamics Content aggregators Content consumers

Brief (non-technical) history n Early keyword-based engines n n Altavista, Excite, Infoseek, Inktomi, ca. 1995 -1997 Paid placement ranking: Goto. com (morphed into Overture. com Yahoo!) n n Search ranking depended on how much you paid Auction for keywords: casino was expensive! 9

Brief (non-technical) history n 1998+: Link-based ranking pioneered by Google n n Blew away all early engines save Inktomi Great user experience in search of a business model Meanwhile Goto/Overture’s annual revenues were nearing $1 billion Result: Google added paid-placement “ads” to the side, independent of search results n Yahoo follows suit, acquiring Overture (for paid placement) and Inktomi (for search) 10

Ads Algorithmic results. 11

Ads vs. search results n Google has maintained that ads (based on vendors bidding for keywords) do not affect vendors’ rankings in search results Search = miele 12

Ads vs. search results n Other vendors (Yahoo, MSN) have made similar statements from time to time n n Any of them can change anytime We will focus primarily on search results independent of paid placement ads n Although the latter is a fascinating technical subject in itself 13

Web search basics User Web spider Search Indexer The Web Indexes Ad indexes

User Needs n Need [Brod 02, RL 04] n n n Informational – want to learn about something (~40% / 65%) Low hemoglobin, Hb. A 1 C Navigational – want to go to that page (~25% / 15%) United Airlines, APOD Transactional – want to do something (web-mediated) (~35% / 20%) n n Downloads n n Access a service Shop Seattle weather, WINGS Mars surface images Canon Powershot SD 100 Gray areas n n Car rentals, IR Bibliography Find a good hub Exploratory search “see what’s there” 15

Web search users and queries n Make ill defined queries n Short n n n AV 2001: 2. 54 terms avg, 80% < 3 words) Imprecise terms, and sub -optimal syntax (most queries without operator) Wide variance in n n Needs Expectations Knowledge Bandwidth n Specific behavior n n n 85% look over one result screen only 78% of queries are not modified (one query/session) Follow links – “the scent of information”. . . 16

Query Distribution Power law: few popular broad queries, many rare specific queries 17

How far do people look for results? (Source: iprospect. com White. Paper_2006_Search. Engine. User. Behavior. pdf) 18

True example* Noisy building fan in courtyard TASK Mis-conception Info about EPA regulations Info Need Mis-translation Verbal form Mis-formulation Query What are the EPA rules about noise pollution EPA sound pollution SEARCH ENGINE Polysemy Synonimy Query Refinement Results Corpus 19 * To Google or to GOTO, Business Week Online, September 28, 2001

Users’ empirical evaluation of results n Quality of pages varies widely n n Relevance is not enough Other desirable qualities (non IR!!) n n Precision vs. recall n On the web, recall seldom matters n n Except when the number of matches is very small What matters n n n Content: Trustworthy, new info, non-duplicates, well maintained, Web readability: display correctly & fast No annoyances: pop-ups, etc Precision at 1? Precision above the fold? Comprehensiveness – must be able to deal with obscure queries User perceptions may be unscientific, but are significant over a large aggregate 20

Users’ empirical evaluation of engines n n n Relevance and validity of results UI – Simple, no clutter, error tolerant Trust – Results are objective Coverage of topics for poly-semic queries Pre/Post process tools provided n n Mitigate user errors (auto spell check, syntax errors, …) Explicit: Search within results, more like this, refine. . . Anticipative: related searches Deal with idiosyncrasies n Web specific vocabulary n n n Impact on stemming, spell-check, etc Web addresses typed in the search box … 21

Loyalty to a given search engine (i. Prospect Survey, 4/04) 22

The Web corpus n n n The Web n No design/co-ordination Distributed content creation and linking, democratization of publishing Content includes truth, lies, obsolete information, contradictions … Unstructured (text, html, …), semistructured (XML, annotated photos), structured (Databases)… Scale much larger than previous text corpora … but corporate records are catching up. Growth – slowed down from initial “volume doubling every few months” but still expanding Content can be dynamically generated 23

The Web: Dynamic content n A page without a static HTML version n E. g. , current status of flight AA 129 Current availability of rooms at a hotel Usually, assembled at the time of a request from a browser n Typically, URL has a ‘? ’ character in it AA 129 Application server Browser Back-end databases

Dynamic content n Most dynamic content is ignored by web spiders n n Some dynamic content (news stories from subscriptions) are delivered as dynamic content n n n Many reasons including malicious spider traps Application-specific spidering Spiders commonly view web pages just as Lynx (a text browser) would Note: even “static” pages are typically assembled on the fly (e. g. , headers are common) 25

The web: size n What is being measured? n n Number of hosts Number of (static) html pages n n Number of hosts – netcraft survey n n n Volume of data http: //news. netcraft. com/archives/web_server_survey. html Monthly report on how many web hosts & servers are out there Number of pages – numerous estimates (will discuss later) 26

Netcraft Web Server Survey http: //news. netcraft. com/archives/web_server_survey. html 27

The web: evolution n n All of these numbers keep changing Relatively few scientific studies of the evolution of the web [Fetterly & al, 2003] n n http: //research. microsoft. com/research/sv/svpubs/p 97 -fetterly. pdf Sometimes possible to extrapolate from small samples (fractal models) [Dill & al, 2001] n http: //www. vldb. org/conf/2001/P 069. pdf 28

Rate of change n [Cho 00] 720 K pages from 270 popular sites sampled daily from Feb 17 – Jun 14, 1999 n n [Fett 02] Massive study 151 M pages checked over few months n n n Any changes: 40% weekly, 23% daily Significant changed -- 7% weekly Small changes – 25% weekly [Ntul 04] 154 large sites re-crawled from scratch weekly n n 8% new pages/week 8% die 5% new content 25% new links/week 29

Static pages: rate of change n Fetterly et al. study (2002): several views of data, 150 million pages over 11 weekly crawls n Bucketed into 85 groups by extent of change 30

Other characteristics n Significant duplication n High linkage n n More than 8 links/page in the average Complex graph topology n n Syntactic – 30%-40% (near) duplicates [Brod 97, Shiv 99 b, etc. ] Semantic – ? ? ? Not a small world; bow-tie structure [Brod 00] Spam n Billions of pages 31

Spam Search Engine Optimization 32

The trouble with paid placement… n n It costs money. What’s the alternative? Search Engine Optimization: n n n “Tuning” your web page to rank highly in the search results for select keywords Alternative to paying for placement Thus, intrinsically a marketing function Performed by companies, webmasters and consultants (“Search engine optimizers”) for their clients Some perfectly legitimate, some very shady 33

Simplest forms n First generation engines relied heavily on tf/idf n n The top-ranked pages for the query maui resort were the ones containing the most maui’s and resort’s SEOs responded with dense repetitions of chosen terms n n e. g. , maui resort Often, the repetitions would be in the same color as the background of the web page n n Repeated terms got indexed by crawlers But not visible to humans on browsers Pure word density cannot be trusted as an IR signal 34

Variants of keyword stuffing n n Misleading meta-tags, excessive repetition Hidden text with colors, style sheet tricks, etc. Meta-Tags = “… London hotels, hotel, holiday inn, hilton, discount, booking, reservation, sex, mp 3, britney spears, viagra, …” 35

Search engine optimization (Spam) n Motives n n n Operators n n Commercial, political, religious, lobbies Promotion funded by advertising budget Contractors (Search Engine Optimizers) for lobbies, companies Web masters Hosting services Forums n E. g. , Web master world ( www. webmasterworld. com ) n Search engine specific tricks 36

Cloaking n n Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate Y SPAM Is this a Search Engine spider? Cloaking N Real Doc 37

The spam industry 38

More spam techniques n Doorway pages n n Link spamming n n n Pages optimized for a single keyword that re-direct to the real target page Mutual admiration societies, hidden links, awards Domain flooding: numerous domains that point or redirect to a target page Robots n Fake query stream – rank checking programs n n “Curve-fit” ranking programs of search engines Millions of submissions via Add-Url 39

The war against spam n Quality signals - Prefer authoritative pages based on: n n Anti robot test Spam recognition by machine learning n n n Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by association) n n Linguistic analysis, general classification techniques, etc. For images: flesh tone detectors, source text analysis, etc. Editorial intervention Limits on meta-keywords Robust link analysis n Training set based on known spam Family friendly filters n Policing of URL submissions n n Votes from authors (linkage signals) Votes from users (usage signals) n n n Blacklists Top queries audited Complaints addressed Suspect pattern detection 40

More on spam n Web search engines have policies on SEO practices they tolerate/block n n http: //help. yahoo. com/help/us/ysearch/index. html http: //www. google. com/intl/en/webmasters/ Adversarial IR: the unending (technical) battle between SEO’s and web search engines Research http: //airweb. cse. lehigh. edu/ 41

Answering “the need behind the query” n n Semantic analysis Query language determination n Hard & soft (partial) matches n n n n Auto filtering Different ranking (if query in Japanese do not return English) Personalities (triggered on names) Cities (travel info, maps) Medical info (triggered on names and/or results) Stock quotes, news (triggered on stock symbol) Company info Etc. Natural Language reformulation Integration of Search and Text Analysis 42

The spatial context -- geo-search Two aspects n Geo-coding -- encode geographic coordinates to make search effective n Geo-parsing -- the process of identifying geographic context. n Geo-coding n Geometrical hierarchy (squares) n Natural hierarchy (country, state, county, city, zip-codes, etc) n. Geo-parsing n Pages (infer from phone nos, zip, etc). About 10% can be parsed. n Queries (use dictionary of place names) n Users n n Explicit (tell me your location -- used by NL, registration, from ISP) From IP data Mobile phones n In its infancy, many issues (display size, privacy, etc) 43

Yahoo!: britney spears 44

Ask Jeeves: las vegas 45

Yahoo!: salvador hotels 46

Yahoo shortcuts n Various types of queries that are “understood” 47

Google andrei broder new york 48

Answering “the need behind the query”: Context n Context determination n n n spatial (user location/target location) query stream (previous queries) personal (user profile) explicit (user choice of a vertical search, ) implicit (use Google from France, use google. fr) Context use n Result restriction n n Kill inappropriate results Ranking modulation n Use a “rough” generic ranking, but personalize later 49

Google: dentists bronx 50

Yahoo!: dentists (bronx) 51

52

Query expansion 53

Context transfer 54

No transfer 55

Context transfer 56

Transfer from search results 57

58