Metadata Extraction Web Archives Automating the Record

Зарегистрируйтесь, чтобы просмотреть полный документ!

Metadata Extraction & Web Archives: Automating the Record Creation Process Abbie Grotke / abgr@loc. gov Gina Jones / gjon@loc. gov Library of Congress Office of Strategic Initiatives Web Capture Team

Library of Congress Web Archives • Since 2000, 20+ thematic, event-based collections • 100 TB+ of data collected • 12, 500+ URLs http: //www. loc. gov/lcwa

Web Archiving Tools • Crawling: – Heritrix – WARC • Access: – Wayback Machine – Nutch. WAX International Internet Preservation Consortium netpreserve. org

LC’s Web Archive Workflow • Identify & select URLs (LS or LAW) • Determine crawl strategy, create a seed list for crawling (OSI) • Sites harvested by Internet Archive or in-house crawlers (OSI), • Quality Review (OSI & curators) • Create “catalogers list” (OSI) and XML MODS template (LS) for metadata extraction

Describing the Archives • Collection-level MARC record in OPAC • Item-level MODS records in LCWA – One record per recommended URL for each distinct collection • With so many thousands of URLs to process, how do we streamline the process?

XML MODS Template

Metadata Extraction • For each URL that will be cataloged: – Get archived web site metadata – Combine with URL Nominations Database metadata – If elections/campaign web site, metadata also pulled from our candidate Access database (used to create subject terms) • Using XML template, we add collection and record level metadata • Create a single file for delivery

Data Sources for Metadata Extraction

URL Nominations Database • • • URL Access Rights Language(s) Category Subject Terms

Election Candidate Metadata • • • Name URL Party Affiliation State Race District (House)

Archived Web Site Metadata From 1 st capture: • • Document Title Keywords Abstract Mime Types From Wayback index: • Capture Dates (First & Last)

Combined Data in Template

Скачать презентацию Metadata Extraction Web Archives Automating the Record

66cee0aaca3acd2254bd613226f9a192.ppt

Количество слайдов: 14