7357abf48ae51993bc6514b0b7321128.ppt
- Количество слайдов: 88
Introduction to Digital Libraries Week 8: Crawling, Indexing, Searching Old Dominion University Department of Computer Science CS 751/851 Spring 2011 Michael L. Nelson <mln@cs. odu. edu> 03/01/11 ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Outline • • • Web crawling Indexing what you've crawled Searching indexes Metasearching Focused crawling Search engine APIs ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Browsers Are For Wimps! % telnet www. google. com 80 Trying 66. 249. 80. 104. . . Connected to www. l. google. com. Escape character is '^]'. HEAD / HTTP/1. 1 Host: www. google. com User-Agent: I'm A Real CS Type -- Woohoo! Connection: close Real CS types open up a telnet connection on port 80 and speak raw http (read RFC-2616). HTTP/1. 1 200 OK Date: Sun, 28 Feb 2010 03: 53: 22 GMT Expires: -1 Cache-Control: private, max-age=0 Content-Type: text/html; charset=ISO-8859 -1 Set-Cookie: PREF=ID=e 1543 e 29439 e 6335: TM=1267329202: LM=1267329202: S=XSCl 0 te. Vm. M_ol. C 7 P; expires=Tue, 28 -Feb-2012 03: 53: 22 GMT; path=/; domain=. google. com Set-Cookie: NID=32=MNPYwx. Ij. GIm 8 x. UScf 3 Ti 70 v 2 tl. Zb. Vt 2 qpoday[much deletia]; expires=Mon, 30 -Aug-2010 03: 53: 22 GMT; path=/; domain=. google. com; Http. Only Server: gws X-XSS-Protection: 0 Connection: close Connection to www. l. google. com closed by foreign host. ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Sometimes They Use c. URL… % curl --head -A "Slightly less hard core" http: //google. com/ HTTP/1. 1 301 Moved Permanently Location: http: //www. google. com/ Content-Type: text/html; charset=UTF-8 Date: Sun, 28 Feb 2010 04: 10: 16 GMT Expires: Tue, 30 Mar 2010 04: 10: 16 GMT Cache-Control: public, max-age=2592000 Server: gws Content-Length: 219 X-XSS-Protection: 0 % curl --head http: //www. google. com/ HTTP/1. 1 200 OK Date: Sun, 28 Feb 2010 04: 03: 17 GMT Expires: -1 Cache-Control: private, max-age=0 Content-Type: text/html; charset=ISO-8859 -1 Set-Cookie: [similar to before] Server: gws X-XSS-Protection: 0 Transfer-Encoding: chunked ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
More c. URL % curl -o hello-from-google. html http: //www. google. com/ % Total % Received % Xferd Average Speed Time Current Dload Upload Total Spent Left Speed 100 7123 0 0 94159 0 --: --: -- 119 k % head -1 hello-from-google. html <!doctype html><head><meta http-equiv="content-type" content="text/html; charset=ISO-8859 -1"><title>Google</title><script>window. google={k. EI: "L-y. JS 8 Pu. P OGKl. Qe. Yur. TQDw", k. EXPI: "23867, 23934, 23955", k. CSI: {e: "23867, 23934, 23955", ei: "L-y. JS 8 Pu. POGKl. Qe. Yur. TQDw", expi: "23867, 23934, 23955"}, ml: function(){}, k. HL: "en", time: funct ion(){return(new Date). get. Time()}, log: function(b, d, c){var a=new Image, e=google, g=e. lc, f=e. li; a. onerror=(a. onload=(a. onabort=function(){delete g[f]})); g[f]=a; c=c||"/gen_204? atyp=i&ct="+b+"&cad="+d+"&zx="+google. time(); a. src=c; e. li=f+1}, lc: [], li: 0, Toolbelt: {}}; More info: http: //curl. haxx. se/ % man curl ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Or wget… % wget http: //www. cs. odu. edu/~mln/ --2010 -02 -27 23: 20: 28 -- http: //www. cs. odu. edu/~mln/ Resolving www. cs. odu. edu. . . 128. 82. 4. 2 Connecting to www. cs. odu. edu|128. 82. 4. 2|: 80. . . connected. HTTP request sent, awaiting response. . . 200 OK Length: 4480 (4. 4 K) [text/html] Saving to: `index. html' 100%[======================>] 4, 480 --. -K/s 2010 -02 -27 23: 20: 28 (49. 1 MB/s) - `index. html' saved [4480/4480] % head index. html <html> <head> <title> Home: : Michael L. Nelson, Old Dominion University </title> <!-- CSS stuff largely stolen from Carl Lagoze's Page --> <link rel="stylesheet" type="text/css" href="mln. css"/> </head> <body> More info: http: //www. gnu. org/software/wget/ % man wget ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu in 0 s
wget the Whole Thing… % wget -r -l 1 http: //www. cs. odu. edu/~mln/ --2010 -02 -27 23: 21: 41 -- http: //www. cs. odu. edu/~mln/ Resolving www. cs. odu. edu. . . 128. 82. 4. 2 Connecting to www. cs. odu. edu|128. 82. 4. 2|: 80. . . connected. HTTP request sent, awaiting response. . . 200 OK Length: 4480 (4. 4 K) [text/html] Saving to: `www. cs. odu. edu/~mln/index. html' 100%[======================>] 4, 480 --. -K/s in 0 s 2010 -02 -27 23: 21: 41 (35. 6 MB/s) - `www. cs. odu. edu/~mln/index. html' saved [4480/4480] Loading robots. txt; please ignore errors. --2010 -02 -27 23: 21: 41 -- http: //www. cs. odu. edu/robots. txt Reusing existing connection to www. cs. odu. edu: 80. HTTP request sent, awaiting response. . . 200 OK Length: 54 [text/plain] Saving to: `www. cs. odu. edu/robots. txt' 100%[======================>] 54 --. -K/s in 0 s 2010 -02 -27 23: 21: 41 (1. 66 MB/s) - `www. cs. odu. edu/robots. txt' saved [54/54] --2010 -02 -27 23: 21: 41 -- http: //www. cs. odu. edu/~mln/mln. css Reusing existing connection to www. cs. odu. edu: 80. [much deletia] ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Checking the Results… % ls hello-from-google. html index. html www. cs. odu. edu/ % ls -l. R www. cs. odu. edu/: total 80 drwxr-xr-x 8 mln faculty 1024 2010 -02 -27 23: 21 ~mln/ -rw-r--r-- 1 mln faculty 60868 2010 -02 -27 23: 21 index. html -rw-r--r-- 1 mln faculty 54 2009 -07 -24 09: 01 robots. txt www. cs. odu. edu/~mln: total 496 -rw-r--r-- 1 mln faculty 277152 2009 -12 -21 16: 55 cv. pdf drwxr-xr-x 2 mln faculty 80 2010 -02 -27 23: 21 images/ -rw-r--r-- 1 mln faculty 4480 2010 -01 -05 14: 33 index. html -rw-r--r-- 1 mln faculty 1642 2006 -08 -11 18: 53 lineage. html -rw-r--r-- 1 mln faculty 92868 2007 -03 -19 13: 00 mln-ad. pdf -rw-r--r-- 1 mln faculty 1635 2009 -06 -29 14: 39 mln. css -rw-r--r-- 1 mln faculty 80339 2009 -12 -21 16: 56 nsf-cv-2009. pdf drwxr-xr-x 2 mln faculty 80 2010 -02 -27 23: 21 personal/ drwxr-xr-x 2 mln faculty 80 2010 -02 -27 23: 21 pubs/ drwxr-xr-x 2 mln faculty 80 2010 -02 -27 23: 21 research/ drwxr-xr-x 2 mln faculty 80 2010 -02 -27 23: 21 service/ drwxr-xr-x 2 mln faculty 80 2010 -02 -27 23: 21 teaching/ -rw-r--r-- 1 mln faculty 13960 2010 -01 -10 14: 58 travel. html [deletia] ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
General Crawler Design • • c. URL grabs a single URL, wget can recursively grab an entire site all crawlers follow the same algorithm: 1. 2. 3. 4. 5. start with seed URLs add seeds to frontier (i. e. , URLs to be crawled) download URL from frontier extract URLs from representation and add to frontier repeat #3 until some condition is met: – frontier empty (!) – depth level met – # of files, storage limit met – etc. ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
robots. txt • A very simple specifications for suggestions to polite robots – can't enforce robots. txt! – http: //www. robotstxt. org/ • Examples: – – – http: //www. odu. edu/robots. txt http: //www. cs. odu. edu/robots. txt http: //www. google. com/robots. txt http: //www. cnn. com/robots. txt http: //www. ebay. com/robots. txt – http: //www. yahoo. com/robots. txt ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Heritrix • Large-scale, open source web crawler from the Internet Archive – http: //crawler. archive. org/ – http: //sourceforge. net/projects/archivecrawler/files/ – http: //webteam. archive. org/confluence/displ ay/Heritrix/2. 0+Tutorial ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Heritrix % ls HOWTO-Launch-Heritrix. txt bin/ jobs/ LICENSE. txt conf/ lib/ README. txt extras/ % cd bin % ls arcreader* foreground_heritrix* htmlextractor. cmd* arcreader. cmd* foreground_heritrix. cmd* jmxclient* cmdline-jmxclient-0. 10. 5. jar* heritrix* make_reports. pl* dependencies. xsl* heritrix. cmd* manifest_bundle. pl* extractor* hoppath. pl* xdoc. To. Txt. xsl* extractor. cmd* htmlextractor* %. /heritrix -a admin WARNING: $HERITRIX_HOME/conf/jmxremote. password not found. WARNING: Disabling remote JMX. Sat Feb 27 23: 39: 44 EST 2010 Starting heritrix. . . . . No JNDI context. Engine registered at org. archive. crawler: instance=14054523, jmxport=-1, name=Engine, type=org. archive. crawler. framework. Engine, host=michael-nelsons-computer-2. local Web UI listening on localhost: 8080. % ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Heritrix Opening Page ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Crawl Engine Page ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Seed Page ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Sheets Page ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Sheet Info Page ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Copy Page ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Ready Jobs Page ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Launch Page ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Launch Page (Refreshed) ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Crawl Log ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Crawl Report ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Frontier Report ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Command Line Access % ls arcs/ hosts-report. txt associations-report. txt logs/ config. txt mimetype-report. txt crawl-manifest. txt processors-report. txt crawl-report. txt responsecode-report. txt frontier-report. txt scratch/ % cat seeds. txt http: //www. cs. odu. edu/~mln/ % head mimetype-report. txt [#urls] [#bytes] [mime-types] 220 11674626 image/jpeg 135 1362095 text/html 45 32647 image/png 39 17143886 application/pdf 18 41414362 application/vnd. ms-powerpoint 8 506 text/dns 7 3625 text/plain 2 51069 image/gif 1 136855 application/rss+xml ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu seeds-report. txt seeds. txt sheets/ state/
Command Line Access (2) % head hosts-report. txt [#urls] [#bytes] [host] [#robots] [#remaining] 453 71765472 www. cs. odu. edu 0 1265 8 506 dns: 0 0 5 64183 www. cs. unc. edu 0 0 3 4643 feed. mikle. com 0 0 2 137396 ws-dl. blogspot. com 0 0 2 48107 www. ariadne. ac. uk 0 0 2 10720 www. dlib. org 0 0 2 147462 www. openarchives. org 0 0 2 10304 www. wctatel. net 0 0 % ls logs/ alerts. log progress-statistics. log crawl. log recover. gz nonfatal-errors. log runtime-errors. log % ls arcs/ IAH-20100228051342 -00000 -michael-nelsons-computer-2. local. arc. gz ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu uri-errors. log
(W)ARC Files figure from: http: //www. iwaw. net/05/kunze. pdf ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Arc Files % cd arcs % gunzip -c * | less filedesc: //IAH-20100228051342 -00000 -michael-nelsons-computer-2. local. arc 0. 0 20100228051342 text/plain 1162 1 1 Internet. Archive URL IP-address Archive-date Content-type Archive-length <? xml version="1. 0" encoding="UTF-8" standalone="yes"? > <arcmetadata xmlns: dc="http: //purl. org/dc/elements/1. 1/" xmlns: dcterms="http: //purl. org/dc/terms/" xmlns: arc="http: //archive. org/arc/1. 0/" xmlns: xsi="http: //www. w 3. org/2001/XMLSchema-instance" xmlns="http: //archive. org/arc/1. 0/" xsi: schema. Location="http: //archive. org/arc/1. 0/ http: //www. archive. org/arc/1. 0/arc. xsd"> <arc: software>Heritrix 2. 0. 1 http: //crawler. archive. org</arc: software> <arc: hostname>michael-nelsons-computer-2. local</arc: hostname> <arc: ip>10. 0. 1. 2</arc: ip> <dcterms: is. Part. Of>My first crawl</dcterms: is. Part. Of> <dc: description>Basic seeds sites crawl. </dc: description> <arc: operator>Michael Nelson</arc: operator> <arc: http-header-user-agent>Mozilla/5. 0 (compatible; heritrix/2. 0. 1 +http: //www. cs. odu. edu/~mln/)</arc: http-header-user-agent> <arc: http-header-from>mln@cs. odu. edu</arc: http-header-from> <arc: robots>CLASSIC</arc: robots> <dc: format>ARC file version 1. 1</dc: format> <dcterms: conforms. To xsi: type="dcterms: URI"> http: //www. archive. org/web/researcher/Arc. File. Format. php</dcterms: conforms. To> </arcmetadata> ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
dns: www. cs. odu. edu 10. 0. 1. 1 20100228051339 text/dns 55 20100228051339 xenon. cs. odu. edu. 86400 IN A 128. 82. 4. 2 http: //www. cs. odu. edu/robots. txt 128. 82. 4. 2 20100228051358 text/plain 321 HTTP/1. 1 200 OK Date: Sun, 28 Feb 2010 05: 14: 15 GMT Server: Apache/2. 2. 14 (Unix) DAV/2 PHP/5. 2. 11 Last-Modified: Fri, 24 Jul 2009 13: 01: 27 GMT ETag: "53 cb-36 -46 f 7333 c 037 c 0" Accept-Ranges: bytes Content-Length: 54 Connection: close Content-Type: text/plain Arc File (2) User-agent: * Disallow: /~extract/Document. Collections http: //www. cs. odu. edu/~mln/ 128. 82. 4. 2 20100228051402 text/html 4752 HTTP/1. 1 200 OK Date: Sun, 28 Feb 2010 05: 14: 18 GMT Server: Apache/2. 2. 14 (Unix) DAV/2 PHP/5. 2. 11 Last-Modified: Tue, 05 Jan 2010 19: 33: 14 GMT ETag: "13 ed 9 f-1180 -47 c 6 fe 8 b 172 bb" Accept-Ranges: bytes Content-Length: 4480 Connection: close Content-Type: text/html <html> <head> ODU CS <title> [much, much deletia] 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
WARC Files • WARC = Web ARC files -- an expanded/updated version of ARC files – http: //bibnum. bnf. fr/WARC/ – http: //www. digitalpreservation. gov/formats/fdd 000236. shtml – http: //www. iwaw. net/05/kunze. pdf • Main improvement: generalize the types of records to not just be http responses (resource representations) ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
WARC Record Types • Warcinfo: typically 1 per. warc file, crawl-level metadata • Response: representation returned from server • Resource: a representation not returned from server (e. g. , a file from a file system) • Request: recording the http request headers that produced the request • Metadata: metadata about another WARC record • Revisit: a duplicate of a prior record • Conversion: indicate format migration • Continuation: record spans more than 1 WARC file the following 8 examples from: http: //archive-access. sourceforge. net/warc_file_format-0. 9. html ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Warcinfo warc/0. 9 1012 warcinfo filedesc: test-20050708010101 -00001 -crawl 017. archive. org. warc. gz 20050708010101 text/xml uuid: cbad 35 b 7 -e 591 -4 b 43 -8 a 67 -9 d 1 d 8 f 9 ef 4 cd <? xml version="1. 0" encoding="UTF-8" standalone="yes"? > <warcmetadata xmlns: dc="http: //purl. org/dc/elements/1. 1/" xmlns: dcterms="http: //purl. org/dc/terms/" xmlns: warc="http: //archive. org/warc/0. 9/"> <warc: software> Heritrix 1. 4. 0 http: //crawler. archive. org </warc: software> <warc: hostname>crawling 017. archive. org</warc: hostname> <warc: ip>207. 241. 227. 234</warc: ip> <dcterms: is. Part. Of>testcrawl-20050708</dcterms: is. Part. Of> <dc: description>testcrawl with WARC output</dc: description> <warc: operator>IA_Admin</warc: operator> <warc: http-header-user-agent> Mozilla/5. 0 (compatible; heritrix/1. 4. 0 +http: //crawler. archive. org) </warc: http-header-user-agent> <dc: format>WARC file version 0. 9</dc: format> <dcterms: conforms. To xsi: type="dcterms: URI"> http: //www. archive. org/documents/Warc. File. Format. php </dcterms: conforms. To> </warcmetadata> ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Request warc/0. 9 298 request http: //www. archive. org/images/logo. jpg 20050708010101 message/http uuid: f 569983 a-ef 8 c-4 e 62 -b 347 -295 b 227 c 3 e 51 IP-Address: 207. 241. 224. 241 GET /images/logo. jpg HTTP/1. 0 Host: www. archive. org User-Agent: Mozilla/5. 0 (compatible; crawler/1. 4 +http: //example. com) ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Response warc/0. 9 7583 response http: //www. archive. org/images/logo. jpg 20050708010101 message/http uuid: a 4 b 26 b 6 b-f 918 -4136 -af 04 -f 859 d 75 aebe 5 IP-Address: 207. 241. 224. 241 Related-Record-ID: uuid: f 569983 a-ef 8 c-4 e 62 -b 347 -295 b 227 c 3 e 51 Checksum: sha 1: 2 ZWC 6 JAT 6 KNXKD 37 F 7 MOEKXQMRY 75 YY 4 HTTP/1. x 200 OK Date: Fri, 08 Jul 2005 01: 01 GMT Server: Apache/1. 3. 33 (Debian GNU/Linux) PHP/5. 0. 4 -0. 3 Last-Modified: Sun, 12 Jun 2005 00: 31: 01 GMT Etag: "914480 -1 b 2 e-42 ab 8245" Accept-Ranges: bytes Content-Length: 6958 Keep-Alive: timeout=15, max=100 Connection: Keep-Alive Content-Type: image/jpeg [6958 bytes of binary data here] ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Resource warc/0. 9 7141 resource file: //webserver/htdoc/images/logo. jpg 20050710010101 image/jpeg uuid: a 6 c 3132 b-49 b 8 -4 fd 5 -8072 -45 ce 66 d 48 a 4 b Checksum: sha 1: 37 F 7 MOEKXQMRY 75 YY 42 ZWC 6 JAT 6 KNXKD [6958 bytes of binary data here] ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Metadata warc/0. 9 395 metadata http: //www. archive. org/images/logo. jpg 20050708010101 text/xml uuid: a 4 acff 63 -c 213 -4 f 35 -9652 -41 a 0 e 2 dfc 492 Related-Record-ID: uuid: a 4 b 26 b 6 b-f 918 -4136 -af 04 -f 859 d 75 aebe 5 <? xml version="1. 0"? > <harvestmetadata xmlns="http: //archive. org/harvest/0. 9/"> <discovered-via>http: //www. archive. org<discovered-via> <download-time-ms>565</download-time-ms> </harvestmetadata> ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Revisit warc/0. 9 395 revisit http: //www. archive. org/images/logo. jpg 20050808010101 text/xml uuid: ad 522 b 3 b-d 68 c-464 a-b 5 e 2 -38149 cfb 511 d Related-Record-ID: uuid: a 4 b 26 b 6 b-f 918 -4136 -af 04 -f 859 d 75 aebe 5 <? xml version="1. 0"? > <revisit xmlns="http: //archive. org/revisit/0. 9/"> <server-response-excerpt> HTTP/1. x 304 Not Modified Date: Mon, 08 Aug 2005 01: 01 GMT Etag: "914480 -1 b 2 e-42 ab 8245" </server-response-excerpt> </revisit> ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Conversion warc/0. 9 4111 conversion http: //www. archive. org/images/logo. jpg 20150708010101 image/neoimg uuid: c 631 da 8 a-e 8 db-44 a 8 -84 c 5 -9 cc 848 dff 35 a Related-Record-ID: uuid: a 4 b 26 b 6 b-f 918 -4136 -af 04 -f 859 d 75 aebe 5 Checksum: sha 1: XQMRY 75 YY 42 ZWC 6 JAT 6 KNXKD 37 F 7 MOEK [3098 bytes of binary data here] ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Continuation warc/0. 9 39514322 continuation http: //www. archive. org/images/logo. jpg 20150708010101 message/http uuid: c 0 d 36 ada-af 8 c-4608 -8409 -e 60818 b 1 d 9 e 9 Segment-Number: 2 Segment-Origin-ID: uuid: a 4 b 26 b 6 b-f 918 -4136 -af 04 -f 859 d 75 aebe 5 [39514114 bytes of binary data here] ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Indexing Your Pages • Several open source options, two main branches: – My. SQL FULLTEXT index • http: //dev. mysql. com/doc/refman/5. 5/en/fulltextsearch. html – Apache Lucene • http: //lucene. apache. org/java/docs/ • http: //en. wikipedia. org/wiki/Lucene • Others: – Swish-e • http: //swish-e. org/ – see: "A comparison of open source search engines" http: //wrg. upf. edu/WRG/dctos/Middleton-Baeza. pdf ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Lucene Related Projects • Nutch -- a crawler & indexer – http: //lucene. apache. org/nutch/ • Nutch. WAX -- extensions to Nutch to work with ARC files – http: //archiveaccess. sourceforge. net/projects/nutch/ • Solr -- "enterprise" extensions to Lucene – http: //lucene. apache. org/solr/ ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Wanted to Use Nutch. WAX… % ls CHANGES. txt conf/ nutch-1. 0 -dev. jar LICENSE. txt contrib/ nutch-1. 0 -dev. job NOTICE. txt default. properties nutch-1. 0 -dev. war README. txt docs/ plugins/ bin/ lib/ src/ build. xml logs/ webapps/ % cd bin %. /start-all. sh starting namenode, logging to /Users/mln/Downloads/nutchwax-0. 12. 9/bin/. . /logs/hadoop-mln-namenode-michael-nelsons-computer-2. loc Password: localhost: starting datanode, logging to /Users/mln/Downloads/nutchwax-0. 12. 9/bin/. . /logs/hadoop-mln-datanode-michael-nelsons-computer-2. loc Password: localhost: starting secondarynamenode, logging to /Users/mln/Downloads/nutchwax-0. 12. 9/bin/. . /logs/hadoop-mln-secondarynamenode-michael-nelsons-compu localhost: Exception in thread "main" java. lang. Null. Pointer. Exception localhost: at org. apache. hadoop. net. Net. Utils. create. Socket. Addr(Net. Utils. java: 130) localhost: at org. apache. hadoop. dfs. Name. Node. get. Address(Name. Node. java: 116) localhost: at org. apache. hadoop. dfs. Name. Node. get. Address(Name. Node. java: 120) localhost: at org. apache. hadoop. dfs. Secondary. Name. Node. initialize(Secondary. Name. Node. java: 12 localhost: at org. apache. hadoop. dfs. Secondary. Name. Node. <init>(Secondary. Name. Node. java: 108) localhost: at org. apache. hadoop. dfs. Secondary. Name. Node. main(Secondary. Name. Node. java: 460) starting jobtracker, logging to / Users/mln/Downloads/nutchwax-0. 12. 9/bin/. . /logs/hadoop-mln-jobtracker-michael-nelsons-computer-2. lo Password: localhost: starting tasktracker, logging to /Users/mln/Downloads/nutchwax-0. 12. 9/bin/. . /logs/hadoop-mln-tasktracker-michael-nelsons-computer-2. ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
So the Import Did Not Work… %. /nutchwax import /Users/mln/Downloads/heritrix-2. 0. 1/jobs/ completed-My first crawl/arcs/ IAH-20100228051342 -00000 -michael-nelsons-computer-2. local. arc. gz Fatal error: java. io. IOException: Job failed! at org. apache. hadoop. mapred. Job. Client. run. Job(Job. Client. java: 1113) at org. archive. nutchwax. Importer. run(Importer. java: 663) at org. apache. hadoop. util. Tool. Runner. run(Tool. Runner. java: 65) at org. archive. nutchwax. Importer. main(Importer. java: 699) ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Solr Input Files % ls books. csv monitor 2. xml solr. xml hd. xml mp 500. xml test_utf 8. sh* ipod_other. xml payload. xml utf 8 -example. xml ipod_video. xml post. jar vidcard. xml mem. xml post. sh* monitor. xml sd 500. xml % grep memory *xml mem. xml: <field name="cat">memory</field> mp 500. xml: <field name="features">memory card: Compact. Flash, Micro Drive, Smart. Media, Memory Stick Pro, SD Card, and Multi. Media. Card</field> payload. xml: <field name="cat">memory</field> payload. xml: <field name="payloads">electronics|6. 0 memory|3. 0</field> payload. xml: <field name="cat">memory</field> payload. xml: <field name="payloads">electronics|4. 0 memory|2. 0</field> payload. xml: <field name="cat">memory</field> payload. xml: <field name="payloads">electronics|0. 9 memory|0. 1</field> ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Solr Admin Page ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Solr Search for "Memory" ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
POSTing an HTML File note: REST-based approach of POSTing files to the "extract" resource % curl 'http: //localhost: 8983/solr/update/extract? literal. id=doc 100&commit=true' -F "myfile=@. . /docs/tutorial. html" <? xml version="1. 0" encoding="UTF-8"? > <response> <lst name="response. Header"><int name="status">0</int><int name="QTime">87</int></lst> </response> More info: http: //wiki. apache. org/solr/Extracting. Request. Handler ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Query for "Tutorial" http: //localhost: 8983/solr/select? q=tutorial ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Retaining the HTML Content % curl 'http: //localhost: 8983/solr/update/extract? literal. id=doc 1&uprefix=attr_&fmap. content=attr_content&commit=true' -F "myfile=@. . /docs/tutorial. html" <? xml version="1. 0" encoding="UTF-8"? > <response> <lst name="response. Header"><int name="status">0</int><int name="QTime">109</int></lst> </response> More info: http: //wiki. apache. org/solr/Extracting. Request. Handler ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Query for "Tutorial" in Content http: //localhost: 8983/solr/select? q=attr_content: tutorial ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Apache Tika for File Wrappers % curl "http: //localhost: 8983/solr/update/extract? &extract. Only=true" --databinary @pdf 55. pdf -H 'Content-type: application/pdf' | less % Total % Received % Xferd Average Speed Time Current Dload Upload Total Spent Left Speed 100 314 k 0 0 100 314 k 0 335 k --: --: -- 336 k 101 <? xml version="1. 0" encoding="UTF-8"? > <response> <lst name="response. Header"><int name="status">0</int><int name="QTime">920</int></ls t><str>< ? xml version="1. 0" encoding="UTF-8"? > < html xmlns="http: //www. w 3. org/1999/xhtml"> < head> < title> p 46 -bergmark. dvi< /title> < /head> < body> < div> < p> Collection Synthesis see: http: //incubator. apache. org/tika/ Donna Bergmark http: //lucene. apache. org/tika/formats. html Cornell Digital Library Research Group Upson Hall Ithaca, NY 14853 bergmark@cs. cornell. edu ABSTRACT The invention of the hyperlink and the HTTP transmission protocol caused an amazing new structure to appear on the ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu [deletia]
The "Deep Web" • "Deep Web" >> "Surface Web" – term created by M. Bergman in 2000 • http: //dx. doi. org/10. 3998/3336451. 0007. 104 • in 2000, estimated 100 -500 X greater than surface web (likely much less now) • see also: http: //www. mkbergman. com/458/newcurrents-in-the-deep-web/ • If you can't get to the content buried in the various deep web sites… ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Metasearching (adapted from Gravano, 1997) • 3 functions of a metasearcher – choosing the sources to query • the source-metadata problem – dispatching the query to those sources • the query language problem – merging the query results • the rank-merging problem ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Source-Metadata Problem • How do you choose which sources to query? – manual • applicable only for small #’s of sources – automatic • how does the metasearcher “know” the nature of the various sources? what if there are 1000 s of different sources? – approaches • extract enough of their publicly available information and guess • have the source explicitly export a description of itself ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Query-Language Problem • Different remote sources use different search engines with different syntaxes – boolean vs. vector • even if all support boolean, still could have different syntax – different field names for fielded searching – stemming vs. no stemming – different stop lists ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Rank-Merging Problem • Each source ranks its results which is valid only locally – there is no global scale to rank against, so ranked results cannot be “shuffled” together meaningfully – example: my repository's 15 th hit might be more relevant than your repository's 1 st hit • Issues: – proprietary ranking algorithms – even if the algorithms are known or even homogeneous, the collection that the document comes from impacts its ranking ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Remote Indices • Assuming you are providing access to data that is controlled by many heterogeneous sources, you can have: – control or influence over the remote providers; getting them to follow some conventions or protocols • STARTS, Z 39. 50, – no control or influence over the remote providers; its up to you to work around syntax and content issues • almost everything else… ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
STARTS • Stanford Protocol Proposal for Internet Retrieval and Search - STARTS • http: //www-db. stanford. edu/~gravano/starts. html – Gravano, et al. , 1997 • Stanford coordinated a protocol proposal effort with participation from the various search engine vendors – Infoseek, Fulcrum, PLS, Verity, etc. • Key point: you now have influence over the remote sources ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
STARTS Goals • Define a protocol that strikes a balance between being simple enough for the vendors to implement, but powerful enough to tap engines w/ advanced features • Social as well as technical issues – the vendors are competitors, and details about the workings of the search engines are proprietary ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
STARTS Terminology and Assumptions • Source = a collection of documents • Resource = a collection of sources • Assumptions – – no nested documents no non-textual documents security and error conditions ignored protocol is for machine-machine communication; users do not write STARTS queries ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
STARTS Query Language • A query consists of: – filter expression • boolean construct instructing what to search for in the source collection – ranking expression • vector-type construct instructing how to rank the return results ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
STARTS Sample Query • (example 1 from paper): • filter expression ((author “Ulman”) and (title “databases”)) • ranking expression list ((body-of-text “distributed) (body-of-text “databases”)) • hits must have “Ulman” in the author field, “databases” in the title field, and when ranking, it should give preference to documents that have keywords “distributed” and “databases” in the text ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
STARTS • l-string: a string (“Michael Nelson”) or a qualified string [en-US “Michael Nelson”] – en-US is the language/country encoding • a term is an l-string modified by an unordered list of attributes • an attribute is either a field or a modifier ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Fields • Fields associate terms with a section of a document (cf. Dublin Core) Title (Req) Author Body-of-text Document-text (New) (For relevance feedback) Date/time-last-modified (Req) (Formatted according to the International Standard ISO 8601 (e. g. , "1996 -12 -31")) Any (Req) Linkage (Req) (URL of the document) Linkage-type (MIME type of the document) Cross-reference-linkage (List of URLs in document) Language (The language(s) of the document, as a list of language tags as defined in RFC 1766. ) (For example, a query term (language "en-US") matches a document with value for the language field "en-US es". This document has parts in American English and in Spanish. ) Free-form-text (New) (A string, maybe representing a query in some query language not in the protocol, that the source somehow knows how to interpret) from: http: //www-db. stanford. edu/~gravano/starts. html ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Modifiers <, <=, =, >, != (If applicable (e. g. , for fields like "Date/time-last-modified"), default: =) Phonetic (soundex) (Default: no soundex) Stem (Default: no stemming) Thesaurus (New) (Default: no thesaurus expansion) Right-truncation (Default: the term "as is, " without right-truncating it) Left-truncation (Default: the term "as is, " without left-truncating it) Case-sensitive (New) (Default: case insensitive) from: http: //www-db. stanford. edu/~gravano/starts. html ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Filter Operators • and • or • and-not – note: no “not” -- “and-not” implies a positive component • prox – proximity matching ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Ranking Expression • list ((“distributed” 0. 7) (“databases” 0. 3)) – treat “distributed” as more important than “databases” • “list” is the most common operator in ranking expressions, but sources can choose to weight the following rankings differently: – (“distributed” and “databases”) – list(“distributed” “databases”) ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Summary Object Interchange Format @SQuery{ Version{10}: STARTS 1. 0 Filter. Expression{50}: ((author "Garcia Molina") and (title "databases")) Ranking. Expression{61}: list((body-of-text "distributed") (body-of-text "databases")) Drop. Stop. Words{1}: T Default. Attribute. Set{7}: basic-1 Default. Language{5}: en-US Answer. Fields{12}: title author Min. Document. Score{3}: 0. 5 Max. Number. Documents{2}: 10 } A complete STARTS query. The numbers in brackets are the # of bytes (characters) from: http: //www-db. stanford. edu/~gravano/starts. html ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
STARTS Query Result @SQResults{ Version{10}: STARTS 1. 0 Sources{8}: Source-2 Actual. Filter. Expression{50}: ((author "Garcia Molina") and (title "databases")) Actual. Ranking. Expression{26}: (body-of-text "databases") /* maybe "distributed" was a stop word */ Num. Doc. SOIFs{1}: 4 } Note: client sends query Q, server sends back Q’ -- the query it answered This is because the client might not implement everything in Q. This is a header; Num. Doc. SOIFs is the # of hits from: http: //www-db. stanford. edu/~gravano/starts. html ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
STARTS Query Result @SQRDocument{ Version{10}: STARTS 1. 0 Raw. Score{4}: 0. 82 Sources{8}: Source-2 linkage{51}: http: //www-db. stanford. edu/pub/gravano/1995/vldb. ps title{44}: Generalizing Gl. OSS to Vector-Space Databases author{34}: Luis Gravano, Hector Garcia-Molina Term. Stats{89}: (body-of-text "distributed") 10 0. 31 190 (body-of-text "databases") 15 0. 51 232 Doc. Size{3}: 248 /*kilobytes*/ Doc. Count{5}: 10213 /*tokens in doc*/ Term. Stats can be used for the } metasearcher to re-rank the. . . merged results @SQRDocument{. . . } from: http: //www-db. stanford. edu/~gravano/starts. html ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
STARTS Source Metadata @SMeta. Attributes{ Version{10}: STARTS 1. 0 Source. ID{8}: Source-1 Fields. Supported{17}: [basic-1 author] Modifiers. Supported{19}: {basic-1 phonetics} Field. Modifier. Combinations{39}: ([basic-1 author] {basic-1 phonetics}) Query. Parts. Supported{2}: RF Score. Range{7}: 0. 0 1. 0 Ranking. Algorithm. ID{8}: Acme-1. . . Default. Meta. Attribute. Set{6}: mbasic-1 source-language{8}: en-US es source-name{18}: Stanford DB Group linkage{26}: http: //www-db. stanford. edu/cgi-bin/query content-summary-linkage{38}: ftp: //www-db. stanford. edu/cont_sum. txt date-changed{9}: 1996 -03 -31 from: http: //www-db. stanford. edu/~gravano/starts. html } ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
@SContent. Summary{ Version{10}: STARTS 1. 0 Stemming{1}: F Stop. Words{1}: F Case. Sensitive{1}: F Fields{1}: T Num. Docs{3}: 892 Field{5}: title Language{5}: en-US Term. Doc. Freq{11023}: "algorithm" 100 53 "analysis" 50 23. . . Field{5}: title Language{5}: en-US Term. Doc. Freq{12020}: "databases" 89 21 "distributed" 102 45 STARTS Content Summary In-depth content summaries aid in the selection of sources (automated and manually). Field/Language pairs can be repeated for incremental updates. . Field{5}: title Language{2}: es Term. Doc. Freq{1211}: "algoritmo" 23 11 "datos" 59 12. . . } from: http: //www-db. stanford. edu/~gravano/starts. html ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
STARTS Resource Metadata @SResource{ Version{10}: STARTS 1. 0 Source. List{83}: Source_1 ftp: //www. stanford. edu/source_1 Stanford-1 Source_2 ftp: //www. stanford. edu/source_2 Stanford-1 } This lists all the sources that a Resource knows about. From the Source. List, we can retrieve SMeta. Attributes SOIFs from: http: //www-db. stanford. edu/~gravano/starts. html ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
STARTS Status • Source-Metadata: – SResource, SMeta. Attributes, SContent. Summary SOIFs • Query-Language: – separate filter and ranking expressions; clients can ignore non-implemented portions • Rank-Merging: – pass in ranking expressions; SQRDocument SOIF has enough info for the metasearcher to re-rank if it desires ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
STARTS Status • STARTS was designed explicitly to address the 3 metasearcher issues • However, it requires: – a lot of coordination and cooperation on the part of the vendors – clients/proxies that can issue STARTS requests • The protocol was stable, but the vendors never really adopted it – it did play an important role in some of the subsequent systems that were developed ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Z 39. 50 • Distributed searching protocol in use today; origins back to 1980 s – good resources: • Lynch, D-Lib Magazine, 3(2), 1997, http: //www. dlib. org/dlib/april 97/04 lynch. html • LC Maintenance Agency page: http: //www. loc. gov/z 3950/agency/ – very different from “usual” Internet based protocols: • • stateful session-oriented results stored server-side not text-based (cf. SMTP, NNTP, HTTP, etc. ) – widely used in library environments for searching & distribution of MARC records ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
SRU • Search / Retrieval via URL – an http profile for Z 39. 50 – SRU - the REST implementation • REST - URL in, XML out – http: //www. loc. gov/standards/sru/ – (the protocol formerly known as) SRW - the SOAP implementation • now known as "SRU via HTTP SOAP" • SOAP - XML in, XML out – http: //www. loc. gov/standards/sru/specs/transport. htm l#soap ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
SRU Operations • Explain – description of the database contents + server functionality • cf. OAI-PMH “Identify” • http: //www. loc. gov/standards/sru/specs/explain. html • Scan – describe term frequency in database • cf. STARTS “Content Summary” • http: //www. loc. gov/standards/sru/specs/scan. html • Search. Retrieve – submit request + retrieval options • cf. STARTS “Query” (w/o ranking filter) • http: //www. loc. gov/standards/sru/specs/search-retrieve. html demo: http: //www. loc. gov/standards/sru/resources/servers. html ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Common Query Language (CQL) • Goal: make simple things simple, and complex things possible – hide the implementation details • CQL examples – http: //zing. z 3950. org/cql/intro. html – http: //www. loc. gov/standards/sru/specs/cql. html – many of the same issues that we saw earlier in STARTS… ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Open. Search • Sharing search results – http: //www. opensearch. org/Specifications/Open. Se arch/1. 1 • Description document: URL templates for how clients can invoke searches • Response elements: extending formats like RSS & Atom with search information (e. g. , pagination, specific kinds of queries: examples, "repeat this query", etc. ) ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Open. Search in Atom ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Open. Search on You. Tube <link rel="search" type="application/opensearchdescription+xml" href="http: //www. youtube. com/opensearch? locale=en_US" title="You. Tube Video Search"> ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
You. Tube Open. Search Description <Open. Search. Description xmlns="http: //a 9. com/-/spec/opensearch/1. 1/"> <Short. Name>You. Tube Video Search</Short. Name> <Description>Search for videos on You. Tube</Description> <Tags>youtube video</Tags> <Image height="16" width="16" type="image/vnd. microsoft. icon">http: //www. youtube. com/favicon. ico</Image> <Url type="text/html" template="http: //www. youtube. com/results? search_query={search. Terms}& page={sta rt. Page? }& utm_source=opensearch" /> <Query role="example" search. Terms="cat" /> </Open. Search. Description> ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Wikipedia Open. Search Description <? xml version="1. 0"? > <Open. Search. Description xmlns="http: //a 9. com/-/spec/opensearch/1. 1/" xmlns: moz="http: //www. mozilla. org/2006/browser/search/"> <Short. Name>Wikipedia (en)</Short. Name> <Description>Wikipedia (en)</Description> <Image height="16" width="16" type="image/xicon">http: //en. wikipedia. org/favicon. ico</Image> <Url type="text/html" method="get" template="http: //en. wikipedia. org/w/index. php? title=Special: Search& search={se arch. Terms}" /> <Url type="application/x-suggestions+json" method="get" template="http: //en. wikipedia. org/w/api. php? action=opensearch& search={search. T erms}& namespace=0" /> <Url type="application/x-suggestions+xml" method="get" template="http: //en. wikipedia. org/w/api. php? action=opensearch& format=xml& search={search. Terms}& namespace=0" /> <moz: Search. Form>http: //en. wikipedia. org/wiki/Special: Search</moz: Search. Form> </Open. Search. Description> <link rel="search" type="application/opensearchdescription+xml" href="/w/opensearch_desc. php" title="Wikipedia (en)" /> ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Focused Crawling • Focused crawling is crawling with a topical purpose • Premise: search engines have only some of the desired content, the rest is undiscovered From Bergmark, JCDL 2002 ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Building Collections From Bergmark, JCDL 2002 • • • Begin with high quality seeds – either by hand or SE query Generate a centroid that captures the "aboutness" of the seeds in aggregate Crawl lots of documents, measure distance of those documents from centroid, discard those that are too far away ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Search Engine APIs • Or, you can skip the crawler step altogether if you believe SEs have sufficient coverage • Yahoo Boss – http: //developer. yahoo. com/search/boss_gui de/ • Google Ajax Search API – http: //code. google. com/apis/ajaxsearch/web. html • Bing API – http: //www. bing. com/developers/ ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
Yahoo Boss • • • Ford Galaxie (XML) Galaxie 500 Band (XML; keyterms) Galaxie 500 (XML; delicious top tags; long abstract) – next page of the above ODU CS 751/851 Spring 2011 Michael L. Nelson mln@cs. odu. edu
7357abf48ae51993bc6514b0b7321128.ppt