ddf1c6615e2c546463474bcdf8edb34c.ppt
- Количество слайдов: 40
FRBR: Algorithms and Applications T. Hickey J. Toves D. Vizine-Goetz Online Compuer Library Center CLA November 2004
Outline § Algorithms • FRBR work matching • Handling author-title variants § Hardware • Beowulf cluster § Applications • Bookmarklets • Fiction. Finder § Future directions
Working with Group 1 Entities WEMI: Work Expression Manifestation Item § Strict expression-level determination is hard • We primarily divide by language § Manifestation is easier • We use the World. Cat master record
Work Identification § Algorithm goals: • Efficient • Understandable • Controllable by catalogers • Uses existing World. Cat records
The Algorithm § A key is generated for each record § Extract author, title • Look up in LC name authority file • Added entry information as needed § Form a key from bibliographic record • Author, title, added entry information • These can be sorted, compared
Example 146 Smollett1721 16 Smollett1721 Expedition of Humphry Clinker Expedition of Humphrey Clinker 8 Smollett1721 Humphry Clinker 4 Smollett1721 Humphrey Clinker 2 Smollett1721 Expedition of Humphry Clinker 1 Smollett1721 Calatoriile lui Humphrey Clinker 1 Smollet1721 Expedition of Humphry Clinker 1 Smollett Humphry Klinkers Reisen
Example (with authorities) 156 Smollett1721 16 Smollett1721 Expedition of Humphry Clinker Expedition of Humphrey Clinker 4 Smollett1721 Humphrey Clinker 1 Smollett1721 Calatoriile lui Humphrey Clinker 1 Smollet1721 Expedition of Humphry Clinker 1 Smollett1721 Humphry Klinkers Reisen
More Detail § Extract author names • Look up in authority file • Currently only personal names • Subfields $abcdq § Extract title • Always use uniform titles if present • Look up author/short title (~$a) • Look up author/long title (~$abfgnp) • Prefer alternative title for non-English § Create key from author/title • Always do NACO normalization (has limitations) • Add information for uncontrolled title-main-entry
Authority Files Rule! § Authors § Author/titles § Bring together variations § Allow override in difficult cases • Both splitting and joining groups • Especially important with x. ISBN matching § Especially important with non-English metadata
Limitations of the Authority File § What’s missing: • Many uniform titles • Many author variants • Many title variants • Language of heading § Partial solution • Create auxiliary files of mechanically generated matches
Results of FRBR Matching on World. Cat § § § 88% of manifestations are ‘singletons’ 30% of manifestations are in 12% of the works Average size of multiple matches: 3. 1 manifestations/work 43. 1 million works in 54 million manifestations 54% of holdings on a FRBR work with >1 manifestation World. Cat manifestations average about 20 holdings § FRBR helps where help is most needed
More FRBR Results § 310, 000 works have more than 5 manifestations § 1. 7 million have more than 2 manifestations § Largest: 30, 000+ for the Bible § 1, 537 Shakespeare’s Macbeth § 1, 026 Dickens’s Christmas Carol
The Top 10 Works by Holdings Work Holdings Manif’s 1 US Census (various) 403, 252 10, 164 2 Bible (combined) 271, 534 36, 738 3 Mother Goose 66, 543 1, 997 4 Dante, The Divine Comedy 59, 034 2, 714 5 Homer, The Odyssey 43, 871 2, 009 6 Homer, The Iliad 42, 756 2, 388 7 Twain, Huckleberry Finn 39, 310 1, 093 8 Shakespeare, Hamlet 37, 683 1, 917 9 Carroll, Alice’s Adventures in Wonderland 37, 614 1, 865 37, 461 643 10 Tolkien, Lord of the Rings
The Top 10 Works Cataloged in 2003 Work 1 Rowling, Harry Potter and the Order of the Phoenix 2 Clinton, Living History Libraries 2, 406 36, 738 3 Rohmann, My Friend Rabbit 1, 997 4 Brown, The Da Vinci Code 2, 714 5 Gibaldi, MLA Handbook 2, 009
Top 1000 Publication Dates
Top 1000 Languages
Our Beowulf Cluster § 24 Nodes • Each with 2 x 2. 6 GHz processors • 4 GBytes memory (96 GBytes total) § One ‘head’ node, 23 ‘compute’ nodes § 46 x 40 GBytes disk (~2 Terabytes total) § Gigabit switch
What we are using it for § All our bibliographic processing • FRBR • Extractions • Searching • Matching
Ganglia load visualization
Starting point § FRBR key generation § 25 hours on a 3. 00 GHz workstation with 2 GB of RAM § Generate two key files • sort by key, uniq by key, sort by occurrence • sort by key, post processing on keys, uniq by key, sort by occurrence § Merge key files
FRBR on the Cluster § 44 minutes on the cluster § 69 key builders & 23 sort buckets with hyperthreading ON § Generate 23 radix-sorted, post-processed key files § Collapse and sort by occurrence in parallel § Also outputs additional files used by other jobs
Application: Preservation § Identify ‘final copy’ items § Do it at the work level § Single-singles • Single manifestations with single holding • Found 18 million in World. Cat
Application: x. ISBN § A simple Web service § Given an ISBN: • Identify the workset it is in • Return all other ISBNs in that workset § Results should be symmetrical! • Same group retrieved for each ISBN in group § ISBNs sorted by number of library holdings
x. ISBN Example http: //labs. oclc. org/xisbn/0 -19 -281664 -0 returns: <? xml version="1. 0" encoding="UTF-8" ? > <idlist> <isbn>0192816640</isbn> <isbn>0820312037</isbn> <isbn>0820315370</isbn> <isbn>0393015920</isbn> <isbn>0393952274</isbn> <isbn>0393952835</isbn> <isbn>0140430210</isbn> <isbn>0192811320</isbn> <isbn>0192835947</isbn> <isbn>0460872885</isbn> <isbn>1853262706</isbn> <isbn>0874131219</isbn> </idlist>
Matching on ISBNs § ISBN additional information beyond Author/Title • Allows relaxation of matching • Introduces possible errors § Offers the possibility of substantial improvement of work matching
Merging Worksets Using ISBN Matches § Pair ISBNs with FRBR keys (Starts with 10 million ISBNs) § Throw out ISBNs in single worksets § Throw out ISBNs in > 5 worksets (We now have 561, 000 ISBNs left) § Are the titles similar enough? § Throw out large groups § Try to be very conservative § Authority file always overrides other matching
Matches from ISBN Matching § 74, 000 author variants § ~200, 000 title variants § These all create additional cross reference records § Automatically folded into FRBR matching § Kept separate from NACO file • Only used in research at this time
Examples of Possible Matches § § § /mcgraw /mcgraw … hill hill encyclopedia encyclopedia § § § § dickens, dickens, … charles1812 charles1812 of of of science science & & & technology1aar aor technology2apa boo technology3bor cle technology4cli cyt 1870/tale of two cities 1870/hard times 1870/sketches by boz 1870/martin chuzzlewit 1870/bleak house 1870/little dorrit 1870/oliver twist
Application: Bookmarklets
Clicking on Princeton
Fiction. Finder § § § Indexes fiction from World. Cat Uses FRBR workset algorithm Focused on fiction Searching and browsing by • Genre • Fictitious Characters • Imaginary Places • Literary Forms Links to • Google • Open World. Cat Diane Vizine-Goetz’s project
‘Humphry Clinker’ Search
Work Display
Detail of Language Display
First Few English Manifestations
Manifestation Display
Open World. Cat Link
Additional Matches § Match variant titles: • When the wind blows: a novel § Fiction. Finder identified 10, 000 of similar variations • novela, novella, roman, … § Created auxiliary authority records § Now automatically used when FRBR algorithm is run
Future § § § § Continued development of Fiction. Finder Extending algorithm to serials? First. Search displays Additional matching criteria Local authority files? Integration of auxiliary files for production? Exploring FRBRizing some European catalogs Looking at extending beyond Roman characters
Links § IFLA FRBR - Final Report • http: //www. ifla. org/VII/s 13/frbr. htm § Article in DLib • http: //www. dlib. org/dlib/september 02/hickey/09 hickey. html § OCLC Research Activities with FRBR • http: //www. oclc. org/research/projects/frbr/ § Fiction. Finder • http: //fictionfinder. oclc. org/ § Top 1000 • http: //www. oclc. org/research/top 1000/
ddf1c6615e2c546463474bcdf8edb34c.ppt