Скачать презентацию Publisher Name Authority Project An Attempt to Enhance Скачать презентацию Publisher Name Authority Project An Attempt to Enhance

4091b70ac4d6380de1072ee52cefdfc4.ppt

  • Количество слайдов: 39

Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Publisher Name Authority Project: An Attempt to Enhance Data Mining for Collection Analysis & Comparison Lynn Silipigni Connaway Consulting Research Scientist III Akeisha Heard Technical Intern XXV Annual Charleston Conference 04 November 2005

Introduction Introduction

Research Goals § Develop a service to support advanced collection intelligence § Cluster collected Research Goals § Develop a service to support advanced collection intelligence § Cluster collected objects based on their issuing entity • As can be determined via metadata about the objects • Gain intelligence about the nature of individual publishers • Collection intelligence • Acquisition patterns • User behavior

Research Objectives § Resolve • ISBN prefixes to publisher name • Variant publisher names Research Objectives § Resolve • ISBN prefixes to publisher name • Variant publisher names to a preferred form § Capture and make available for use various attributes of individual publishers • Location of publisher • Language(s) of materials published • Genre(s)/format(s) of materials published • Dominant subject domain(s) of the publisher's output • Parent company and subsidiaries

Theoretical Foundation: Authority Control § Adhere to authorized form • Personal names • Corporate Theoretical Foundation: Authority Control § Adhere to authorized form • Personal names • Corporate entities § Why no authorized form for publishing entities?

Pragmatic Foundation: Collection Development § Identified publisher series • Retrospective conversion project (1984) § Pragmatic Foundation: Collection Development § Identified publisher series • Retrospective conversion project (1984) § Family tree • Which publishers are related? § Approval plans • Which publishers publish which subjects?

Pragmatic Foundation: OCLC World. Cat Data Mining § § Collection Analysis • Which libraries Pragmatic Foundation: OCLC World. Cat Data Mining § § Collection Analysis • Which libraries have the most items by a publisher in a particular subject area? • How do library holdings by publisher compare? E-books for a particular STM publisher (2000) • Cataloged as reproductions • 2 publishers!

Pragmatic Foundation: Citation Analysis § Sweetland (1989) • Reader functions of citations • Information Pragmatic Foundation: Citation Analysis § Sweetland (1989) • Reader functions of citations • Information retrieval via citation databases • Document retrieval • Includes interlibrary loan verification • Bibliometrics • Faculty and researcher productivity measure § Other functions • Creation of references/bibliographies

Pragmatic Foundation : Education for Librarians § Collection development & acquisitions librarian education • Pragmatic Foundation : Education for Librarians § Collection development & acquisitions librarian education • Subject focuses of publishers • Parent and subsidiary relationships

Specialized Corporate Authority Files § ACOLIT (Ruggeri, 2004) • Names, uniform titles, Italian and Specialized Corporate Authority Files § ACOLIT (Ruggeri, 2004) • Names, uniform titles, Italian and international Catholic institutions, Catholic religious communities, and institutions • Related to the Catholic Church, Papal State, and Vatican City State § COPAR (Boddaert, 2004) • French official corporate bodies • Mainly national and preceding the French Revolution § CORELI (Boddaert, 2004) • Religious corporate bodies from 3 French ancient specialized catalogues

Specialized Corporate Authority Files § Chinese Modern Authority Database (Hu, Tam & Lo, 2004) Specialized Corporate Authority Files § Chinese Modern Authority Database (Hu, Tam & Lo, 2004) • Chinese authors of expanded works and Chinese corporate bodies since 1912 § Chinese Name Authority Database (Hu, Tam & Lo, 2004) • Mainly Taiwanese personal names with some Taiwanese corporate bodies

Specialized Corporate Authority Files § Case study by Elias & Fair (1983) • Standard Specialized Corporate Authority Files § Case study by Elias & Fair (1983) • Standard Oil Co. ’s Media Query File • No authority control • 3 professionals in 6 months averaged 12 telephone calls/day from reporters • Decided against canonical list for media names • Noted 20 unique variants for Wall Street Journal including WSJ, Wall St. Jnl, Wall Street Jnl

Specialized Corporate Authority Files § Case study by French, Powell & Schulman (1997, 2000) Specialized Corporate Authority Files § Case study by French, Powell & Schulman (1997, 2000) • Smithsonian Astrophysical Observatory’s Astrophysics Data System database • Programmatically identify author affiliations and map variant names to canonical name • Investigated various techniques separately and iteratively to bring variants together including: • Lexical cleanup • Data clustering algorithms • Approximate string-matching • Reduced number of unique strings by 55% • Required manual review of clusters

Database Quality Database Quality

Literature: Database Quality § Review by O’Neill & Vizine-Goetz (1988) • Busch (1981) • Literature: Database Quality § Review by O’Neill & Vizine-Goetz (1988) • Busch (1981) • < 35% of 141 OCLC libraries routinely reported errors • Pollock & Zamora (1983) • Noted misspellings comprise 90 -96% of errors & include: • Omission • Insertion • Substitution • Transposition

Literature: Database Quality § Intner (1989) • Reviewed 215 matching records in OCLC and Literature: Database Quality § Intner (1989) • Reviewed 215 matching records in OCLC and RLIN • Errors relating to publishers: OCLC Count (Total) Application of AACR 2 & LCRI RLIN % Count (Total) % 64 31. 2 (205) 52 27. 2 (191) MARC tagging in 260 field 4 16. 0 (25) 3 11. 5 (26) Typographic errors 4 12. 5 (32) 6 13. 3 (45)

Literature: Database Quality § Romero (1994) • Evaluated cataloging of library science students • Literature: Database Quality § Romero (1994) • Evaluated cataloging of library science students • Noted 221 errors (28. 22%) in the publisher description area

Issues: Historical Practices § Different rules for abbreviations • LC Rule Interpretation B. 14 Issues: Historical Practices § Different rules for abbreviations • LC Rule Interpretation B. 14 • State postal (2 -letter) abbreviation if it appears in the item along with the place • Anglo-American Cataloguing Rules, Revised (2002) • Abbreviations included in Appendix B. 14

Issues: Historical Practices § ALA Catalog Rules (1941) • Multiple places of publication and Issues: Historical Practices § ALA Catalog Rules (1941) • Multiple places of publication and publishers and neither or first is prominent • Include first listed first, indicate omission • Multiple places of publication and publishers and first is not prominent • Include prominent first • Include first listed second • Unknown place of publication – [n. p. ]

Issues: Historical Practices § Anglo-American Cataloging Rules (1967) • Multiple places of publication and Issues: Historical Practices § Anglo-American Cataloging Rules (1967) • Multiple places of publication and publishers and neither or first is prominent • Include first listed only, omit others • Multiple places of publication and publishers and first is not prominent • Include prominent only, omit others • Unknown place of publication – [n. p. ]

Issues: Historical Practices § Anglo-American Cataloguing Rules, Revised (2002) • Multiple places of publication Issues: Historical Practices § Anglo-American Cataloguing Rules, Revised (2002) • Multiple places of publication and publishers and neither or first is prominent • Include first listed only, omit others • Multiple places of publication and publishers and first is not prominent • Include first listed first • Include prominent second • Unknown place of publication – [S. l. ]

Issues: Historical and Local Practices § “u. a. ” • At least one German Issues: Historical and Local Practices § “u. a. ” • At least one German institution uses “u. a. ” as mark of omission • Means “et al. ” • Not an AACR 2 r rule • Local practice? • Is local practice/policy an error?

Issues: Historical and Local Practices § World. Cat enhanced records • Eliminate or lessen Issues: Historical and Local Practices § World. Cat enhanced records • Eliminate or lessen the probability of these issues

Examining Quality of World. Cat Examining Quality of World. Cat

World. Cat: Publisher Name Selection Criteria § Fixed field lang = “eng” World. Cat: Publisher Name Selection Criteria § Fixed field lang = “eng”

World. Cat: ISBN Validation Errors § World. Cat records with ISBNs: 22. 69% World. Cat: ISBN Validation Errors § World. Cat records with ISBNs: 22. 69%

World. Cat: ISBN Validation Errors English Language Valid Invalid 7, 561, 445 99. 90% World. Cat: ISBN Validation Errors English Language Valid Invalid 7, 561, 445 99. 90% 7, 600 0. 10% 13, 147, 325 99. 88% 15, 654 0. 12% All Languages Valid Invalid

World. Cat: MARC Tagging Errors § Examined English language records based on some known World. Cat: MARC Tagging Errors § Examined English language records based on some known issues and manual evaluation § Total MARC tagging errors found: 11, 874 (0. 03%)

World. Cat: MARC Tagging Errors § MARC 260 vs 300 tagging • In 260 World. Cat: MARC Tagging Errors § MARC 260 vs 300 tagging • In 260 field, information from 300 field in $a, $b, $c and/or $e § Dates tagging • Date in $a or $b • Five digit year • “cm” follows year

World. Cat: Typographical Errors § Used “Typographical Errors in Library Databases” to identify and World. Cat: Typographical Errors § Used “Typographical Errors in Library Databases” to identify and quantify English language World. Cat errors (Ballard, 2005) • Total errors: 26, 599 (0. 08%) • Require manual examination to determine if actual errors • Searching for Institi* • Misspelled: • American Institite of Physics • British Standards Institition • Spelled correctly: • Institiúid Ard-Léinn Bhaile Átha Cliath (Dublin Institute for Advanced Studies)

World. Cat: Typographical Errors § Top words (10. 4%): Word Probability According to Ballard World. Cat: Typographical Errors § Top words (10. 4%): Word Probability According to Ballard Error Type World. Cat Count Worchester Highest Insertion 398 Metheun High Transposition 355 Universt* Highest Omission 299 Unives* Highest Omission 275 Westminister [and] Press Highest Insertion 266 Niagr* High Omission 260 Phildel* High Omission 235 Tallahasee High Omission 234 John Hopkins Press Highest Omission 227 Institi* High Substitution 226

World. Cat: Typographical Errors § “Westminister” • Only included on Ballard list in combination World. Cat: Typographical Errors § “Westminister” • Only included on Ballard list in combination with other words • Total errors in World. Cat: 628 (2. 36%) • Require manual review

Where are we now? Where are we now?

World. Cat: MARC 260 Evaluation § Top 10 terms in 260 $b in World. World. Cat: MARC 260 Evaluation § Top 10 terms in 260 $b in World. Cat Term Count press 2, 094, 111 co 1, 664, 005 university 1, 550, 435 dept 1, 084, 647 pub 984, 234 research 853, 954 service 710, 314 institute 660, 346 office 649, 794 chu ban she 620, 735

World. Cat: MARC 260 Evaluation § University Press names in 260 $b in World. World. Cat: MARC 260 Evaluation § University Press names in 260 $b in World. Cat Term Count oxford 35, 804 hopkins 22, 564 cambridge 21, 951 harvard 17, 069 cornell 11, 305 stanford 10, 900 purdue 5, 468 yale 5, 076 princeton 4, 746 rutgers 3, 854

Clustering § Attempting programmatic clustering of publishers using ISBN prefixes • Data clustering (The Clustering § Attempting programmatic clustering of publishers using ISBN prefixes • Data clustering (The Free Dictionary) • "The science of extracting useful information from large data sets or databases" • Classification of similar objects into different groups • Partitioning of a data set into subsets (clusters) • Data in each subset (ideally) share some common trait

World. Cat: Clustering Example § Used ISBN prefix 019 (Oxford University Press) • Total World. Cat: Clustering Example § Used ISBN prefix 019 (Oxford University Press) • Total World. Cat records: 58, 004, 317 • Records with ISBN prefix 019: 84, 276 (0. 15%) • Non-unique publisher names from ISBN prefix records: 91, 528 One or more 019 ISBN NACO normalized unique publisher names Number of clusters Non-singleton clusters Largest cluster All 019 ISBNs 1, 550 1, 386 919 799 222 (24. 16%) 82 text strings 205 (25. 66%) 81 text strings

Challenges: Publisher Name Authority File § Quality issue • Level of acceptance for cluster Challenges: Publisher Name Authority File § Quality issue • Level of acceptance for cluster • What is acceptable? § Subsidiaries and Relationships • Oxford & Auckland • Examined manually to determine relationship § Form of name • What is acceptable? • Likely to use the most prominent form of name

Questions and Discussion Contact Information: connawal@oclc. org hearda@oclc. org Project Web Site: http: //www. Questions and Discussion Contact Information: [email protected] org [email protected] org Project Web Site: http: //www. oclc. org/research/projects/publisherns/