5246859543dd711e4fa6379450e2c23c.ppt
- Количество слайдов: 100
Digital Libraries (DL): Awareness and Discovery Ariel Frank Dept. of Computer Science Bar-Ilan University 1 Joint research with Nir Yom Tov, Alon Kadury & Elina Masevich A. Frank
Presentation motivation Ø Ad hoc and unsound use of Search Engines (SEs) does not help for retrieval of quality information on the Web. Ø Digital Libraries (DLs), on the other hand, provide high quality information retrieval of authoritative results, especially when doing exploratory search. Ø However, the awareness and discovery of DLs on the Web are still lacking. Ø So what can be done about it? 2 A. Frank
Contents • • • 3 SEs vs. DLs? ! DL Definition/Types How to tilt the balance of SE/DL use? SELFDL Model/Architecture RIDDLE Model/Architecture Future directions A. Frank
Google/SE Awareness 4 A. Frank
So how to overcome Googlism? ! 5 A. Frank
Often heard sayings 6 • “What – is there something to search with besides search engines? ” • “Sure I know all about search engines – I always use google. ” • “Sure I know all about directories – I always use yahoo!” • “Sorry, never heard about digital libraries. ” • “Listen, I’m used to classical libraries. ” • “I can find only E-books in a digital library, no? ” A. Frank
Digital Library Vision? ! 7 A. Frank
Sample list of Digital Libraries • LOC - Library of Congress American Memory (http: //memory. loc. gov/ammem/) • NSDL - National Science DL (http: //nsdl. org) • IPL - Internet Public Library (http: //www. ipl. org) • CDL - California DL (http: //www. cdlib. org) • ADL – Alexandria DL (http: //www. alexandria. ucsb. edu) • BL - British Library (http: //www. bl. uk(/ • NZDL – New Zealand DL (http: //www. nzdl. org/) • Einstein Archives Online (http: //www. alberteinstein. info/) 8 A. Frank
Web Search Engines Meta-Search Engine Index Directory General Specialty Which kind to use? The right one 9 A. Frank
When not to use SEs? • You know it all. • You prefer asking friends (or paid experts ). • You know the Web site for it (and didn’t forget the exact URL or have auto-completion or bookmark or can access through another known site). • You already found a specific/relevant digital library or database (maybe in Invisible Web). • Tired of paid inclusions, SE spamming, and sponsored commercial results. • Tired of chasing down useless URLs. 10 A. Frank
When to use an Index? • Need to search for a narrow piece of information. • Have a specific objective/site in mind. • Want to find/rank many related Web sites. • Want to factor quantity in (index has crawler based results). • Need to check/fix spelling (based on Web statistics). 11 A. Frank
When to use a Directory? • Clear about the exact topic of your query. • Need general information on a rather broad topic/category. • Want to amass knowledge on a fairly wide subject. • Would like to browse (and then search) a certain area. • Want to factor quality in (directory has humanpowered results), not quantity. • Need information that is usually carefully evaluated and even annotated. 12 A. Frank
When to use a Meta-SE? • When single Basic-SE fails to provide good results. • One-stop shopping - prefer to search multiple SEs/sites at once to get blended ranked results (so as to save effort/time). • When the query is simple (complex fields/options don't usually work). • Searching for multi-faceted topics. • Want to get clustered results to focus search on the relevant keywords. • Looking for current events/news. 13 A. Frank
When to use a Specialty-SE? • When general-SE fails to provide good results. • When your target is very topic/technology specific. • Want to find more than just Web pages/sites. • Need more results from the Invisible Web. • Want your search terms to more likely have the meanings you intended them to have. 14 A. Frank
SE Quantity vs. DL Quality? SE DL 15 A. Frank
SE vs. DL Potential Coverage Resources Relevant SE DL 16 A. Frank
Contents • • • 17 SEs vs. DLs? ! DL Definition/Types How to tilt the balance of SE/DL use? SELFDL Model/Architecture RIDDLE Model/Architecture Future directions A. Frank
Classical (Analogical) Library 18 A. Frank
So What is a Digital Library? • There are scores of definitions. • Most are very general and verbose . ØA managed collection of information, with associated services, where the information is stored in digital formats and accessible over a network. Arms, William, Y. , Digital Libraries, MIT press, Cambridge, 2000. 19 A. Frank
Definition - A Digital Library is: 1. Collection of digital objects 2. Collection of knowledge structures 3. Collection of library services 4. Library Categories: Domain, Focus & Topic 5. Quality Control 6. Preservation/Persistence 20 A. Frank
1. Collection of Digital Objects • • • Documents (e. g. , texts, HTML pages) Books Journals Multimedia (images, audio, video, etc…) Charts/Maps . Data objects available directly or indirectly 21 A. Frank
2. Collection of Knowledge Structures • • Metadata: Standards, Markup Indices, Catalogs, Guides Taxonomies, Ontologies, Thesauri Dictionaries, Glossaries, Concordances • Gazetteers • Abstracts/Summaries 22 A. Frank
3. Collection of Library Services • • • Management (computerization, communication) Collections development Search (query formulation) and Browse interfaces Multi-access/use for varied users Online Help, Reference, Consultation Logging, statistics and Performance Measurement Evaluation (PME) • SDI: Selective Dissemination of Information (Push mode) 23 A. Frank
4. Library Categories: Domain, Focus & Topic • Domain: belongs to an area (DNS TLDs). – edu, com, org, gov, us, il, ac. il, co. il, … • Focus: created to serve a certain community of users/patrons. – Academic, Public, National, School, … • Topic: the subject of the collection; can be relatively finely-grained. 24 – Law, Medicine, Music, Web, … A. Frank
5. Quality Control • Selection criteria. • All material is assessed and authorized (“certified”). • Adhere to licensing and copyrights. • Use of Digital Rights Management (DRM). • Integrity enforced (proven quality). • Use of filtering. • Support for profiling/stereotyping. 25 A. Frank
6. Preservation/Persistence • • • 26 Access and usage is long term Serves as an archive Scanning and digitization Quality reproduction of material Material persistency – paper vs. digital media – digital formats (software tools) A. Frank
Need for a delicate balance 27 A. Frank
Web Repositories Hierarchy Digital Library (DL) Search Engine (SE) Directory (Catalog, Guide, Subject Gateway) Basic SE (BSE) Popularity SE (PSE) Meta SE (MSE) 28 A. Frank Harvested Stand-alone DL DL (HDL) (SDL) Federated DL (FDL)
Types of DLs • Stand-alone Digital Library (SDL) – also self-contained, several collections • Federated Digital Library (FDL) – also confederated, networked • Harvested Digital Library (HDL) – also distributed 29 A. Frank
Stand-alone Digital Library (SDL) • The regular (classical) DL. • Implemented locally in a fully computerized fashion, with networked access. • Self-contained material: – edited/generated – scanned/digitized – purchased • Single or Several digital collections. 30 A. Frank
Federated Digital Library (FDL) • • • Contains several autonomous libraries. Based on common focus and topic. Usually heterogeneous repositories. Connected via a network. Forms a flat unified library. Transparent user interface. ¤ The major problem is interoperability 31 A. Frank
Harvested Digital Library (HDL) • Virtual library providing metadata-based access to relevant items distributed over the network. • Objects harvested into metadata (protocol was Harvest/SOIF, nowadays OAI-PMH can be used). • Harvests digital objects, not full DLs. • But has regular DL characteristics. 32 A. Frank
SDL vs. HDL 33 A. Frank
Parallel Evolution of SEs and DLs Search Engines Generations Digital Libraries Generations 1 st Generation – Basic SE (BSE) includes Robots, Indices, Directories, basic/advanced user interfaces. 2 nd Generation – Meta SE (MSE) uses several basic-SEs simultaneously (federated search), ranks gathered pages by relevancy. 34 1 st Generation – Stand-alone (SDL) local, classical, focused material, digitized or scanned. 2 nd Generation – Federated (FDL) Comprised of autonomous SDLs representing related, possibly heterogeneous, network repositories 3 rd Generation – Popularity SE (PSE) uses link analysis and use frequency measures to filter and rank the Web pages. A. Frank 3 rd Generation – Harvested (HDL) contains only summaries and metadata structures; domain focused, of fine granularity.
Contents • • • 35 SEs vs. DLs? ! DL Definition/Types How to tilt the balance of SE/DL use? SELFDL Model/Architecture RIDDLE Model/Architecture Future directions A. Frank
Why are SEs overused? • • 36 I always use Google/Yahoo! It’s just a quick search! The truth? – not sure what I’m looking for. I’m too used to using SEs are more general, no? SEs always give me enough answers. SEs don’t care what my topic/domain is! A. Frank
SE vs. DL - Server Side 37 A. Frank
SE vs. DL - Client Side 38 A. Frank
So what was the message ? 39 A. Frank
Qualitative IR from Digital Library? ! 40 ü Fact: Quantity orientation in SE. ü Fact: Quality orientation in DL. ? Assumption: Accessible DLs in sought after domain. ? Assumption: Usable information retrieval interfaces for DLs. Ø Result: High quality information retrieval from digital libraries! A. Frank
Why are DLs underused (social)? • • 41 Too used to classical libraries (fond memories). No public awareness (an unknown entity). No public relations (unlike for Portals/SEs). No money in it (marketing, banners, services). If It’s a library, you have to pay to use it, no? Are DLs up-to-date at all (as much as SEs)? No DLs in my language (localization). A. Frank
Why are DLs underused (general)? • • 42 Portals don’t offer DLs (services). Aren’t DLs part of the Invisible/Deep Web? DLs are just for experts! Many interests – will need to know many DLs. How to find them at all (need to startjump)? How to find relevant ones (sounds like search). How to find the right one (too many around). Lack of domain coverage (no DL in my area). A. Frank
Why are DLs underused (technical)? • • 43 SEs crawl/index DLs, no? Aren’t directories enough? Aren’t SSEs (Specialized SEs) enough? Too focused/limited (too fine granularity). Need know-how to use DLs (unlike for SEs). Non-usable interfaces (not user-friendly). Mostly textual, not multimedia (like SEs are). A. Frank
DL Awareness & Discovery Problems • Lack of use and familiarity with DLs. • Hard to locate and identify DLs scattered around the Web. • Not enough metadata kept for and on the DLs. • DLs topic and focus and user interfaces are not always clear and usable. 44 A. Frank
So how to tilt the balance of SE/DL use? 45 A. Frank
Sample (Digital) Library Directories • Berkeley Lib. Web (Library Servers via Web) – http: //sunsite. berkeley. edu/Libweb/ • Academic Info: Digital Libraries – http: //www. academicinfo. net/digital. html • Google Directory: Digital Libraries – http: //directory. google. com/Top/Reference/Libraries/Digital/ • Librarians’ Index to the Internet – http: //lii. org/ 46 A. Frank
Use General SEs and DL Directories? • Why can’t just use large general SEs? – noisy results, metadata not sufficient, too many (re)tries to get relevant results. • Why can’t just use existing DL Directories? – messy categorization, non-friendly UI, not all libraries are DLs, not really DL Directories. 47 A. Frank
Some possible directions/solutions • Get SEs to better index, reference, and advertise DLs. • Provide specialized SEs for locating DLs. • Construct and enhance DL directories. • DL coverage of more topics/domains. • Employ SE like interfaces in DLs: 48 – user-friendly interface (Google-like) – easy-to-use site (usability like in SE) A. Frank
If more time. . . we could SEEk more 49 A. Frank
Theory vs. Practice? 50 A. Frank
Contents • • • 51 SEs vs. DLs? ! DL Definition/Types How to tilt the balance of SE/DL use? SELFDL Model/Architecture RIDDLE Model/Architecture Future directions A. Frank
SELFDL Goals Ø Search Engine Locator For Digital Libraries 52 • Discover/identify/classify/generate DL resources/sites in the (in)visible Web. • Supply search tools for users to find relevant DLs for their needs. • Provide better, usable (thin) interfaces for locating DLs. • Raise awareness, knowledge, discovery and use of DLs. A. Frank
Naming 53 A. Frank
SELFDL Model/Architecture Index 54 Directory A. Frank Meta
SELFDL – gateway to world of DLs 55 A. Frank
SELFDL techniques • Harness SE technologies to locate DLs on the Web using: – Extractors: Extract DLs from DLs directories. – Crawlers: focused crawl in search of DLs. – Scripts: Interface with Google/Yahoo APIs. • Use site analysis (search for DL terms). • Support Extended DC (Dublin Core) metadata for each DL. • Provide SELFDL database indexing. 56 A. Frank
DLs Identification test • Manual collection of a list of 65 terms that could be indicative that a Web site is a DL. • Check if there is statistically significant connection between each of the terms and the fact that a Web site is a DL. • Initial statistical test included 100 manually identified DLs and a 100 random Web sites. • The statistical measure used (in SPSS) was Cross tabulation, tested with Chi-square, phi coefficient and Cramer’s V. 57 A. Frank
Results of DLs Identification test • Terms that have been found to be statistically significant: 1. documents, book(s), journal(s), electronic/internet/web resource(s) 2. catalog(s)/catalogue(s) 3. ask a librarian, patron(s) 4. digital library, digital collection(s) 5. copyright(s) 6. preservation/preserve, digitization/digitize 58 A. Frank
SELFDL Directory UI 59 A. Frank
SELFDL Directory classifications Life Science: DDC 570 Children Countries -. IL Earth Science: DDC 550 Academic Commercial -. COM Biology: DDC 574 Professional Educational -. EDU DDC Breeding IANA Topic Focus Domain Digital Library 60 A. Frank
Example DDC topic’s tree 61 A. Frank
SELFDL Directory results example 62 A. Frank
Advantages of SELFDL Directory • Contains just DLs. • Better classification/perspective based on domain/focus/topic. • Provides user-friendly interface; like Google Directory. • Additional metadata (based on DC). 63 A. Frank
SELFDL Index UI 64 A. Frank
SELFDL Index • Results from Web focused crawling. • Can be searched for specific DL criteria: – keywords – DL type (SDL, FDL, HDL) – DL media/content (audio, E-books, E-serials, theses, movies, etc…) – Protocol support (OAI-PMH) 65 A. Frank
SELFDL Index example queries topic: biology domain: com algebra domain: com source: crawler focus: children type: SDL protocol: OAI topic: math media: ebooks 66 A. Frank
SELFDL Index results example 67 A. Frank
Advantages of SELFDL Index • Built according to insights/techniques of various studies in the field. • Supports directory and crawler results. • Provides specialized SE for DLs. • Easy to use query interface. • Supports advanced keywords search. 68 A. Frank
SELFDL Meta 69 A. Frank
SELFDL Meta Engine • Can be searched for DL keywords like in an ordinary search engine. • Intersects SE (i. e. , Google/Yahoo API) results with SELFDL database to extract the current DLs to be returned as query response. • Performs like a regular SE – convenient for public use. 70 A. Frank
SELFDL intersects with Google & Yahoo! results YAHOO! Relevant DLs SELFDL 71 A. Frank Google
SELFDL Meta results example 72 A. Frank
Google “Sponsored” DL Interface 73 A. Frank
Advantages of SELFDL Meta • Provides all the advantages of the SELFDL model (UI, metadata). • Supports query interface for terms, like existing SEs. • Supports intersection between SEs results and relevant DLs. • Supports different orders of results. 74 A. Frank
SELFDL prototype testing methods • Efficiency measures were computed for Directory and Meta. • Satisfaction surveys were given to users before and after SELFDL use. • A check was carried out to find the best GUI for SELFDL (regular or Google-like). 75 A. Frank
Efficiency testing methods • Series of queries were evaluated for results relevancy. • The F-measure was used as the efficiency measure. Where: P – Precision of results R – Relative recall of results F – Weighted harmonic average of P & R = 2 PR/(P+R) • The two components tested were SELFDL Meta and SELFDL Directory. 76 A. Frank
SELFDL Directory vs. DL Directories 77 A. Frank R P
SELFDL Meta vs. Google & Yahoo 78 A. Frank R P
Users’ satisfaction surveys 1. Usability of Web utilities. 2. Ease of locating DLs. 3. Ease of identifying if site is DL. 4. DL results relevance. 5. DL metadata readability. 79 A. Frank
Google DL Interface 80 A. Frank
Contents • • • 81 SEs vs. DLs? ! DL Definition/Types How to tilt the balance of SE/DL use? SELFDL Model/Architecture RIDDLE Model/Architecture Future directions A. Frank
RIDDLE Goals Ø Resource Inquiry and Discovery in a DL Environment 82 • Enable creation of HDLs by harvesting (filtering) relevant SDLs using OAI-PMH. • Enable construction of HDLs based on composition of lower-level HDLs, so as to increase the coverage of DLs’ topics. • Enable information exchange with SELFDL. • Raise awareness, knowledge, discovery and use of DLs. A. Frank
Example of topics’ composition University Life Sciences Exact Sciences Computer Science Software 83 Social Sciences Chemistry Hardware A. Frank
OAI-PMH Protocol • OAI-PMH - Open Archive Initiative (OAI) Protocol for Metadata Harvesting • Tackles lack of uniformity and interoperability between data repositories, that make information sharing between repositories difficult. • Addresses these problems by defining the way queries are sent to repositories and the way answers are received. • Mandates at least one format of metadata for repositories use – Dublin Core (DC). 84 A. Frank
RIDDLE Model/Architecture Layer 5 – Presentation Web interfaces Layer 4 – Aggregated Service Providers Aggregated HDLs Enhanced OAI-PMH HDL Layer 3 – Service Providers HDL OAI-PMH SDL SDL Layer 1 – Internet Web 85 Layer 2 – Data Providers A. Frank
Use of OAI-PMH for FDLs/HDLs • OAI-PMH was planned to support harvesting, as manifested in its name, and also in its design (i. e. , selective harvesting using “Sets”). • However, the number of FDLs that use the protocol is relatively large, while there very few HDLs that employ it. • Since HDLs, unlike FDLs, filter the information, and not just federate it, we investigate ways by which HDLs can filter information using the OAI-PMH protocol. 86 A. Frank
Levels of information filtering • There are 3 levels where information filtering can be done, though each level has its various problems, mostly caused by lack of uniformity between SDLs: 1. Item-level metadata – relates to problems with the use of DC entries (that are well known). 2. Group-level metadata – the use of OAI-PMH Sets for selective harvesting is not well defined, so it can not be easily used for relating to groups of items. 3. Library-level metadata – description of the metadata of this level is not well defined. Creation of HDLs using OAI-PMH is not fully supported. 87 A. Frank
Suggested extensions to OAI-PMH • Since lack of uniformity in SDLs using OAI-PMH prevents effective creation of HDLs. • Provide for better harvesting/filtering capabilities from SDLs, by (re-)use of standards, as follows: 88 1. Item-level metadata – use of extended DC for metadata description, instead of just DC. 2. Group-level metadata – use of a DDC topic as a defined Set identifier. 3. Library-level metadata – use of extended DC for the library description field in the OAI-PMH Identify verb. A. Frank
The RIDDLE Prototype • Provides for regular creation of FDLs. • Enables creation of HDLs by harvesting/filtering the relevant SDLs. • Supports HDL aggregation based on DDC hierarchy. • The user search results return not only items matching the query but also HDLs and SDLs related to the indicated topic. • The user can search the HDLs hierarchy (by textual or directory search) for a specific HDL and further down the aggregated HDLs tree. 89 A. Frank
RIDDLE entry page 90 A. Frank
Sample results page, first entry an HDL 91 A. Frank
HDL aggregation • The HDL aggregation capability is based on: – use of the DDC topics hierarchy. – assigning each HDL a suitable DDC topic identifier. – providing it with an OAI-PMH interface, similar to the what data providers have, thus enabling and supporting a HDLs hierarchy. – supporting both offline and online construction and corresponding search. 92 A. Frank
Directory search with topics 93 A. Frank
RIDDLE Experimentation • Several tests where carried out, as follows: 1. The quality of information retrieval when using a specific HDL vs. use of several FDLs. 2. Ease of discovering and using the aggregated HDLs. 3. User preferences in searching several FDLs vs. use of aggregated HDLs. 94 • Initial testing indicates that use of HDLs and aggregated HDLs are more efficient when compared to the use of separate FDLs. A. Frank
Efficiency measures for RIDDLE 95 A. Frank
Contents • • • 96 SEs vs. DLs? ! DL Definition/Types How to tilt the balance of SE/DL use? SELFDL Model/Architecture RIDDLE Model/Architecture Future directions A. Frank
Future directions 97 • Better locating, identification and ranking of DLs and their categories/types. • Conduct wider, more significant, tests using SELFDL and RIDDLE. • Publish a beta Web version of SELFDL and RIDDLE for public use/feedback. • Better integration between SELFDL and RIDDLE. • Investigate awareness and discovery of DLs on the Web. A. Frank
References • Sharon, T. & Frank, A. , “Digital Libraries on the Internet”, IFLA'00 66 th IFLA Council and General Conference, 13 -18, Jerusalem, Israel, August 2000, http: //www. ifla. org/IV/ifla 66/papers/029 -142 e. htm • • • 98 Hanani, U. & Frank, A. , “The Parallel Evolution of Search Engines and Digital Libraries: their Convergence to the Mega -Portal”, ICDL'00 Kyoto Intl. Conf. on Digital Libraries: Research and Practice, 269 -276, Kyoto, Japan, November 2000, http: //csdl. computer. org/comp/proceedings/kyotodl/2000/1022/00/10220211 abs. htm Yom Tov, N. & Frank, A. , “Harnessing Search Engine Technologies to Increase Awareness and Discovery of Digital Libraries”, 4 th IEEE Intl. Conf. on IT: Research and Education (ITRE), Tel-Aviv, October 2006. Kadury, A. & Frank, A. , “Harvesting and Aggregation of Digital Libraries in the OAI Framework”, WEBIST 2007, 3 rd Intl. Conf. on Web Information Systems and Technologies, 441 -446, Barcelona, Spain, March 2007. A. Frank
Bibliography • • • 99 Arms W. Y. , Digital Libraries, MIT Press, Cambridge, 2000. Hill, L. , Buchel, O. , Janée, G. & Lei, Z. M. , “Integration of Knowledge Organization Systems into Digital Library Architectures”, Position Paper for 13 th ASIS&T SIG/CR Workshop, “Reconceptualizing Classification Research”, 62 -68, Philadelphia, PA, 2002. Pace A. K. , The Ultimate Digital Library, American Library Association, Chicago, 2003. Lossau N. , “Search Engine Technology and Digital Libraries: Libraries Need to Discover the Academic Internet”, D-Lib Magazine, Vol. 10, No. 6, June 2004. Summann F. & Lossau N. , “Search Engine Technology and Digital Libraries: Moving from Theory to Practice”, D-Lib Magazine Online, Vol. 10, No. 9, September 2004. Lippincott J. K. , “Net Generation Students and Libraries”, EDUCAUSE Review, Vol. 40, No. 2, March/April 2005. A. Frank
Still around : -? ) 10 0 A. Frank
5246859543dd711e4fa6379450e2c23c.ppt