66cee0aaca3acd2254bd613226f9a192.ppt
- Количество слайдов: 14
Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / abgr@loc. gov Gina Jones / gjon@loc. gov Library of Congress Office of Strategic Initiatives Web Capture Team
Library of Congress Web Archives • Since 2000, 20+ thematic, event-based collections • 100 TB+ of data collected • 12, 500+ URLs http: //www. loc. gov/lcwa
Web Archiving Tools • Crawling: – Heritrix – WARC • Access: – Wayback Machine – Nutch. WAX International Internet Preservation Consortium netpreserve. org
LC’s Web Archive Workflow • Identify & select URLs (LS or LAW) • Determine crawl strategy, create a seed list for crawling (OSI) • Sites harvested by Internet Archive or in-house crawlers (OSI), • Quality Review (OSI & curators) • Create “catalogers list” (OSI) and XML MODS template (LS) for metadata extraction
Describing the Archives • Collection-level MARC record in OPAC • Item-level MODS records in LCWA – One record per recommended URL for each distinct collection • With so many thousands of URLs to process, how do we streamline the process?
XML MODS Template
Metadata Extraction • For each URL that will be cataloged: – Get archived web site metadata – Combine with URL Nominations Database metadata – If elections/campaign web site, metadata also pulled from our candidate Access database (used to create subject terms) • Using XML template, we add collection and record level metadata • Create a single file for delivery
Data Sources for Metadata Extraction
URL Nominations Database • • • URL Access Rights Language(s) Category Subject Terms
Election Candidate Metadata • • • Name URL Party Affiliation State Race District (House)
Archived Web Site Metadata From 1 st capture: • • Document Title Keywords Abstract Mime Types From Wayback index: • Capture Dates (First & Last)
Combined Data in Template
Combined Data in Template
Combined Data in Template
66cee0aaca3acd2254bd613226f9a192.ppt