Скачать презентацию YAHOOGLE GREAT NEWS FOR SEARCHERS Michael Hunter Reference Скачать презентацию YAHOOGLE GREAT NEWS FOR SEARCHERS Michael Hunter Reference

15b97c6c121b563be898ebe47aa5cfdc.ppt

  • Количество слайдов: 77

YAHOOGLE! GREAT NEWS FOR SEARCHERS! Michael Hunter Reference Librarian Hobart and William Smith Colleges YAHOOGLE! GREAT NEWS FOR SEARCHERS! Michael Hunter Reference Librarian Hobart and William Smith Colleges For Rochester Regional Library Council Member Libraries’ Staff Sponsored by the Rochester Regional Library Council Supported by Regional Bibliographic Databases and Resources Sharing (RBDB) funds granted by the New York State Library 2006

For Today …. . The Landscape of Search Today n A Look Under the For Today …. . The Landscape of Search Today n A Look Under the Hoods n Database n Ranking n Personalization n Search Features n E-text initiatives n Foreign language content n n Yahoogle and World Politics

Web Search @ 2006 Who’s crawling the Web? Yahoo n Owns Allthe. Web, Altavista, Web Search @ 2006 Who’s crawling the Web? Yahoo n Owns Allthe. Web, Altavista, Inktomi, Overture n Google n MSN n Ask. Jeeves owns Teoma n Gigablast n NOTE: Ownership is different from database affiliation n

Most popular services As of July, 2005 n n n Google 48% Yahoo 29% Most popular services As of July, 2005 n n n Google 48% Yahoo 29% (up 20% from 2004) MSN 8% (up 30% from 2004) All others 15% (AOL, AJ, Net, Gig) Study by Harris Interactive (must purchase) www. harrisinteractive. com n Reported in Search Day, 7/12/05 http: //searchenginewatch. com/searchday/a rticle. php/3519361 n

The Landscape of Search: Search Engine Overlap Results compared from 12, 500 random queries The Landscape of Search: Search Engine Overlap Results compared from 12, 500 random queries from the largest engines n 85% were unique to one engine n 11% were shared by any two n 3% were shared by any three n 1% were shared by all n Study by Dogpile, U Pittsburgh and Penn State n n Compare. Search. Engines. dogpile. com/Overlap. Analysis

Yahoo! Inc. 1994 Jerry Yang, David Filo (Stanford) “Yet Another Hierarchical Officious Oracle” Human-edited Yahoo! Inc. 1994 Jerry Yang, David Filo (Stanford) “Yet Another Hierarchical Officious Oracle” Human-edited subject directory n 1996 IPO n 2001 Major financial crisis and massive layoffs n 2002 Crawler results from Google added as a separate service (subject directory continues) n 2004 Replaced Google with its own crawler n

Yahoo! Inc. Revenue – Commercial sites pay for inclusion, advertising and pay-forplacement n Paradigm Yahoo! Inc. Revenue – Commercial sites pay for inclusion, advertising and pay-forplacement n Paradigm – Portal “your home on the Web” n Ethos – Human input from the beginning n

Google Inc. r 1998 Sergey Brin, Lary Page (Stanford) Play on “googol” (a coined Google Inc. r 1998 Sergey Brin, Lary Page (Stanford) Play on “googol” (a coined word for 1 followed by 100 zeros) n 2001 -present Acquired/developed n User-responsive ranking algorithms Application Programming Interfaces (API) Usenet archive News Froogle Blogger Video Search and Store Local Mobile Print Base n 2004 IPO

Google Inc. Revenue – Advertising, pay for placement (Sponsored Links, Adwords); Content (Video store, Google Inc. Revenue – Advertising, pay for placement (Sponsored Links, Adwords); Content (Video store, Google Base) n Paradigm – “Clean” look (hidden portal) n Ethos – Technology-based from the beginning; prides itself on a minimum of direct human input in results ranking n

A Few Yahoogle Metrics Keynote Systems study Dec. , ’ 05 n % reporting A Few Yahoogle Metrics Keynote Systems study Dec. , ’ 05 n % reporting task success for Local search Y – 82% G – 83% n % reporting task success for Image search Y – 66% G – 71% n Study of News site popularity by Greg Jarboe Oct. , ’ 05 #1 Yahoo #2 CNN #3 MSNBC #14 Google n

A Look Under the Hoods: Database, Ranking, Personalization Search Features, E-text, Foreign Language Content A Look Under the Hoods: Database, Ranking, Personalization Search Features, E-text, Foreign Language Content

Database Size August 2005 - End of “Size Wars”? Y “ 19 billion web Database Size August 2005 - End of “Size Wars”? Y “ 19 billion web documents” G page count over 8 billion; page count removed soon afterwards n March 15, 2006 – Searching for a filtered word with “Strict Filtering” activated Y 44. 3 billion (web index) G 25. 3 billion (web index) n

Database Freshness G Aims to completely refresh its entire database every 3 weeks. n Database Freshness G Aims to completely refresh its entire database every 3 weeks. n G News sites , blog sites and other rapidly changing sites are crawled every hour or sooner n G Sites that change little are re-crawled less frequently n Study at U. of Dusseldorf (eprints. rclis. org/archive/00004619/01/jis_ preprint. pdf) G “best overall”; Y updating “more chaotic” n

Depth of Indexing Y includes first 500 K of each page crawled n G Depth of Indexing Y includes first 500 K of each page crawled n G includes first 101 K of each page crawled n G includes “partially indexed” or “unindexed” pages n

What types of pages in Google are unindexed? Dead or inaccurate links n Duplicate What types of pages in Google are unindexed? Dead or inaccurate links n Duplicate pages n Database-generated URLs n Pages with robots. txt or noindex meta tags n Pages on an intranet n Pages “waiting” to be indexed fully n

Non-html filetypes No reliable current 3 rd party data available n Search term: bush Non-html filetypes No reliable current 3 rd party data available n Search term: bush (3/12/06). pdf Y 5. 9 m G 20 m. doc Y 719 k G 1 m. ppt Y 1 k G 1 k n

Results Ranking Relevancy and currency increasingly important n Yahoo Purchased AV, Allthe. Web Blends Results Ranking Relevancy and currency increasingly important n Yahoo Purchased AV, Allthe. Web Blends hits from its Directory with crawler results Discloses little about its relevancy processing

Results Ranking at Google Page. Rank, Hilltop and more! Page. Rank’s link-based processing: n Results Ranking at Google Page. Rank, Hilltop and more! Page. Rank’s link-based processing: n Layer I n n Do others think your site is of value as demonstrated by linking to you? IF SO … n Layer II n Are these “others” in turn linked to by sites recognized through linkage within “web communities”?

Page. Rank’s Multi-layered processing n A Favorable Ranking Scenario A. com site selling prosthetics Page. Rank’s Multi-layered processing n A Favorable Ranking Scenario A. com site selling prosthetics linked TO by A local orthopedic association in turn linked TO by A national orthopedic group in turn linked TO by The National Institutes of Health

The trouble with PR … n PR allocates a value of authority to a The trouble with PR … n PR allocates a value of authority to a page based on the number and quality of sites that link to it n A site with a high PR score MAY contain a page matching a query but not be “authoritative” for the topic of that query. n Hilltop determines the authority of a page relative to the query or search term(s). A single page will rank differently depending on the query.

How does Hilltop do this? n Identifies “expert documents” Widely recognized, high quality directories How does Hilltop do this? n Identifies “expert documents” Widely recognized, high quality directories of links or subject metasites n Open Directory, UK’s RDN, WWW Virtual Library et. al. n Runs terms from a given query against these expert documents n Filters out duplicates and affiliated sites n Creates a subset relevant to the query n

How does Hilltop do this? Runs the query in the main Google database n How does Hilltop do this? Runs the query in the main Google database n Assigns a Local. Score to these results based on the linkage to the subset created from “expert documents” n Final ranking based on this and PR, onthe-page factors and more. n

The trouble with Hilltop … Dependent on “expert documents” n Most effective with broad The trouble with Hilltop … Dependent on “expert documents” n Most effective with broad subject queries n Must find a minimum of 2 “expert documents” linked to a page or results returned are zero n PR and other ranking processes then take over n

Personalization n n Re-orders search results based on user’s past searches and click tracks Personalization n n Re-orders search results based on user’s past searches and click tracks Ranking will change, depending on user profiles Requires setting up a (free) account Personalized home page is offered Complex profiles are problematic eg. “Movies, computer hardware, the Internet, general news, astronomy” SEARCH: cars Which categories take precedence over others? ? ?

Yahoo! Personalization No statement concerning a user’s search records if My Yahoo! is terminated Yahoo! Personalization No statement concerning a user’s search records if My Yahoo! is terminated by the users n Search log data for all My Yahoo! searches kept (via cookies) n Yahoo 360 creates an online identity: photos, restaurant reviews, personal blog and more n Yahoo!’s privacy policy: n privacy. yahoo. com

Google Personalization Search records personally associated with a user are deleted if service is Google Personalization Search records personally associated with a user are deleted if service is dropped n Search log data for all Google searches kept (via cookies) n Google’s privacy policy: n www. google. com/privacy. html n Bookmark entire web pages

Search Features Yahoo! Search Features Yahoo!

Yahoo! Three ways in n www. yahoo. com Portal home page (all services) n Yahoo! Three ways in n www. yahoo. com Portal home page (all services) n search. yahoo. com Crawler only n dir. yahoo. com Subject directory only

Yahoo! Tabs for Images, Audio, Video, Local, News, Shopping n Advanced Search Features n Yahoo! Tabs for Images, Audio, Video, Local, News, Shopping n Advanced Search Features n Vertical Search Engines n n Music, health, finance, shopping and over 20 more Don’t forget the Subject Directory (now further down on the search page) n Alerts for news, weather, sports and more n

Yahoo!’s Contextual Searching - Y!Q Selected web pages or highlighted sections analyzed for word Yahoo!’s Contextual Searching - Y!Q Selected web pages or highlighted sections analyzed for word frequency and “concept extraction” and used as basis for a search n Results give basis for query in “context selection box” n Refinements include removing unwanted terms/phrases and “more like this” link n Requires download of free toolbar n toolbar. yahoo. com

Yahoo/OCLC toolbar Searchers may restrict their results to the Open World Cat database, currently Yahoo/OCLC toolbar Searchers may restrict their results to the Open World Cat database, currently at 57 million records n Displays library holdings in the searcher’s vicinity n Download (free) at www. oclc. org/toolbar n

A lot to Yahoo! about n RSS feeds Offered as part of My Yahoo A lot to Yahoo! about n RSS feeds Offered as part of My Yahoo n User-friendly Reader/Aggregator provided; limited to 250, 000 Yahoo-selected feeds n Yahoo content as RSS: News, Ask Yahoo, Buzz Index (popular searches), News Groups n n Video search (beta) //video. search. yahoo. com n n Advanced search features: KW, format, file size, length, content filter Creative commons search. yahoo. com/cc n Content that is free to share or modify

Search Features Google Search Features Google

Google Alerts www. googlealert. com n n n Automated running of user-created saved searches Google Alerts www. googlealert. com n n n Automated running of user-created saved searches once a day or once a week Examines the top 10 news results and the top 20 web search results and e-mails you any that you haven’t seen before Requires a profile for each alert Available in RSS format Alerts also available for Google News ONLY

Search by Number n Enter number in main search box for UPS Tracking # Search by Number n Enter number in main search box for UPS Tracking # n FAA airplane registration # n n Enter number preceded by prefix for Fed. Ex tracking “fedex xxxxxxx” n Patent “patent xxxxxxx” n FCC equipment id’s “fcc xxxxx” n n Current weather at US airports, from FAA’s Air Traffic Control System n 3 -letter code with airport “roc airport”

Google Answers Fee Based answer service n User sets fee ($2. 50 -up) and Google Answers Fee Based answer service n User sets fee ($2. 50 -up) and time frame for question (Guidelines offered) n Searchable archive available n Comments can be added (by anyone) to unanswered questions n Users rate answers n

Google Answers Who are the “researchers”? Must be 18 years old n Write an Google Answers Who are the “researchers”? Must be 18 years old n Write an essay on why you want to be a researcher n Answer 5 sample questions n Training manual available at http: //answers. google. com/answers/ researchertraining. html n

Google’s API Application Program Interface n n Free programs for developers and researchers interested Google’s API Application Program Interface n n Free programs for developers and researchers interested in incorporating Google in their applications n Iterative searches on a topic (SDI) n Search via non-html interfaces n Games that play with Web information Daily limit of 1, 000 queries Uses SOAP (Simple Object Access Protocol) that is XML-based More at //google. com/apis/index. html

Froogle Locates information about products for sale online n Gives URL’s of sites offering Froogle Locates information about products for sale online n Gives URL’s of sites offering the item n Provides links to exact page in the site where you can make the purchase n

Froogle Ranking follows normal Google ranking processes n Paid placements always clearly marked n Froogle Ranking follows normal Google ranking processes n Paid placements always clearly marked n “Sort by price” n Access at http: //froogle. google. com or via Google home page n

Google Earth earth. google. com Geographic search application n Originally Keyhole 3 D, now Google Earth earth. google. com Geographic search application n Originally Keyhole 3 D, now a free Google download n Images taken by satellites and aircraft “sometime in the last 3 years” n “Fly to” accepts an address or co-ordinates, returns a view from 3, 000 ft. above, with zoom capabilities n

Google Local for Mobile google. com/gmm Free download n Unique ID associated with your Google Local for Mobile google. com/gmm Free download n Unique ID associated with your phone n Simplified version of the web-based Local Search n Emphasis on maps and directions n Point-to-point directions limited to a certain area n Business listings offer address and phone number only n Does not support all mobile phones n

Google Video Search/Store video. google. com Index of closed captioning and text descriptions from Google Video Search/Store video. google. com Index of closed captioning and text descriptions from selected TV and other video content after Dec. 2004 n Results can include snippet, description, source, date, duration and hyperlink n Search results can be sorted by Free or For Sale n Purchasing information prominent n

Google Base base. google. com Allows users to send G information they would like Google Base base. google. com Allows users to send G information they would like for it to have n A place to list and describe anything you would like including n Events & Activities Jobs Products n Services Personal Profiles Reviews n Vehicles Want Ads News n n Selling possible

Google Desktop 3. 0’s “Search Across Computers” Allows users to search across all their Google Desktop 3. 0’s “Search Across Computers” Allows users to search across all their computers n Requires user to install and configure the feature n G uploads files from your computers, indexes them and transfers them to your other computers and deletes them from its servers n All computers involved must be online at the time n

Google Desktop 3. 0’s “Search Across Computers” If one computer is not online no Google Desktop 3. 0’s “Search Across Computers” If one computer is not online no data transfer can occur and the files remain on G’s servers for up to 30 days, when they are deleted. n If service is deactivated “some personal account information” may stay on G servers for up to 60 days n Deletions delayed due to G’s backup processes n

Yahoogle and Blogs Search restricted to www. blogger. com Alito Y 210 G 23 Yahoogle and Blogs Search restricted to www. blogger. com Alito Y 210 G 23 Nancy Sinatra Y 201 G 116 n G’s blogsearch. google. com (all blogs) Nancy Sinatra 13, 482 n Y has no separate blog search function n

Electronic Text Initiatives Electronic Text Initiatives

Yahoo!’s Open Content Alliance (10/3/05) Large scale E-text initiative n Members include Yahoo, Internet Yahoo!’s Open Content Alliance (10/3/05) Large scale E-text initiative n Members include Yahoo, Internet Archive, National Archives (UK), RLG, LC, 8 US and 6 Canadian Universities n Over 25, 000 Digitized copies of public domain AND copyrighted works n Works under copyright only available if permission granted by owner n Yahoo plans to include the content in its database or subject directory n

Yahoo! and content for $$$ Yahoo! and content for $$$

Google Print’s 2 divisions Publisher Program and Library Project Publisher Program Publishers authorize G. Google Print’s 2 divisions Publisher Program and Library Project Publisher Program Publishers authorize G. to scan and make searchable the full text of their books n Users see only the full page containing their search terms n Link to purchase copy n

Google Print’s 2 divisions Publisher Program and Library Project n n Scan and make Google Print’s 2 divisions Publisher Program and Library Project n n Scan and make searchable 15 million books, in and out of copyright, from Harvard, Stanford, Oxford, U. Michigan and NYPL For works in copyright, users see only a few sentences around search terms Users may browse full text of public domain works NOTE: Not possible to print ANY material from either Google Print project

Library Project in 2005 June – Assoc. of American Publishers question legality of Library Library Project in 2005 June – Assoc. of American Publishers question legality of Library Project n August 15 – G. “temporarily halts” scanning in -copyright works; continues scanning public domain works n September 20 – Author’s Guild files a formal complaint against G. in NY Federal District Court alleging “massive copyright infringement” n

Library Project in 2006 August 11 – University of California signs with Google to Library Project in 2006 August 11 – University of California signs with Google to scan “several million” of the UC system’s 34 million titles n Google – For works in copyright, only the equivalent of an electronic “library catalog record” has been created. No infringement has occurred. n

Google and content for $$$ Video Store and Google Base Google and content for $$$ Video Store and Google Base

Foreign Language Content Foreign Language Content

Language Features Advanced Search – by language G 35 Y 37 (same plus Persian, Language Features Advanced Search – by language G 35 Y 37 (same plus Persian, Thai) n Advanced Search – by location (country) G 83 Y 27 n

Language Features G “Translate this page” for Spanish, German, French, Italian, Portuguese and , Language Features G “Translate this page” for Spanish, German, French, Italian, Portuguese and , in beta, Japanese, Korean and Chinese (Simplified); service in Language Tools n Y “Translate this page” for same languages PLUS Greek, Dutch and Russian; service at http: //66. 218. 71. 231/language/ n

Content Comparisons Searching by language Imre Kertesz (Hungarian) Y 751 G 575 n Jose Content Comparisons Searching by language Imre Kertesz (Hungarian) Y 751 G 575 n Jose Saramago (Portuguese) Y 172, 000 G 292, 000 n Moussaoui (Arabic) Y 96 G 216 n

Google by language or google. ? ? Google by language or google. ? ?

Google by language or google. ? ? Links to country-specific services at bottom of Google by language or google. ? ? Links to country-specific services at bottom of Language Tools n Imre Kertesz by Hungarian language 575 on google. co. hu 449, 000 n Jose Saramago by Portuguese language 292, 000 on google. pt 1, 470, 000 n

Search and Politics: National and International Search and Politics: National and International

Child Online Protection Act of 1998 Justice Dept: Parental controls and filters insufficient to Child Online Protection Act of 1998 Justice Dept: Parental controls and filters insufficient to protect children against online pornography. Stricter governmental controls needed n Aug. , 2005 – G, Y, Microsoft and AOL issued subpoenas for all data relating to search terms and the sites users visited between June 1 and July 31, 2005 n

Child Online Protection Act of 1998 Y, MSN and AOL “have provided some of Child Online Protection Act of 1998 Y, MSN and AOL “have provided some of the information requested and taken steps to guard users’ privacy” G refused n To date no request for IP address or other data linking search behavior to individual users n

Implications For Users – Invasion of privacy/search behavior, online identity, 1 st Amendment n Implications For Users – Invasion of privacy/search behavior, online identity, 1 st Amendment n For Search Engine Industry – n R&D focused on offering search results customized to an individual n Requires tracking individual’s search behavior n Can privacy be guaranteed n n Hearing US Dist. Court March 13

Search, Censorship and China Chinese Government blocks access to politically sensitive and/or offensive sites Search, Censorship and China Chinese Government blocks access to politically sensitive and/or offensive sites n Jan. 2006 Access to G cut off or degraded by Government n Search terms blocked included Taiwan’s independence, Tiananmen Square, democracy, human rights in China n

Search, Censorship and China When a service maintains an office or other facility in Search, Censorship and China When a service maintains an office or other facility in a country, it is bound to the laws of that country. n G removed content from Google. cn to comply with demands n Y has complied with government demands for several years n MSN removed a blog critical of the Chinese government n

Search, Censorship and China House Subcommittee on Africa, Global Human Rights and International Operations Search, Censorship and China House Subcommittee on Africa, Global Human Rights and International Operations n G, Y, MSN, AOL at hearing on Feb. 16 n The point? ? Leaders of the search industry should voluntarily set best practices for dealing with repressive regimes n If not, Congress may do it n

Yahoogle! Yahoo! strengths Media Vertical (specialized) engines Popular Culture Online communities Local Search Portal Yahoogle! Yahoo! strengths Media Vertical (specialized) engines Popular Culture Online communities Local Search Portal format n Google strengths Overall ranking Blog search International news Clean interface Somewhat larger database (? ? ? ) n

Thank You and Good Luck! Michael Hunter Reference Librarian Hobart and William Smith Colleges Thank You and Good Luck! Michael Hunter Reference Librarian Hobart and William Smith Colleges Geneva, NY 14456 (315) 781 -3552 hunter@hws. edu