f469a12bd7670f14b3f8f8210e5799cd.ppt
- Количество слайдов: 40
Approaches to automated metadata extraction : Fix. Rep Project Emma Tonkin e. tonkin@ukoln. ac. uk UKOLN is supported by: www. ukoln. ac. uk A centre of expertise in digital information management www. bath. ac. uk
Wouldn't it be nice if. . . • . . . computers could author our metadata for us, thus saving a lot of hassle? • Mechanical metadata extraction vs manual metadata input www. ukoln. ac. uk A centre of expertise in digital information management
But. . . • Automated tools are fallible • There's never quite enough information available • Templates change, different domains have different standards • In short, computers are often wrong – and so are people www. ukoln. ac. uk A centre of expertise in digital information management
The 'half a loaf' hypothesis • Hybrid approach: – Get what metadata you can – Ask the user to check and clean it if necessary • Philosophy: – If the computer gets it wrong, we can fix it later www. ukoln. ac. uk A centre of expertise in digital information management
Wouldn’t it be nice if… • …computers could fix our metadata for us? • Or, more realistically, help us do this work for ourselves. www. ukoln. ac. uk A centre of expertise in digital information management
• All about ‘fixing it later’, doing what we can with what we have • Automated metadata extraction + metadata consistency assessment • Metadata generation, evaluation, characterisation: enabling metadata triage www. ukoln. ac. uk A centre of expertise in digital information management
1)Challenges in automated metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage www. ukoln. ac. uk A centre of expertise in digital information management
Whatever can go wrong. . . • PDFs can be: – Encrypted – Corrupted – Oddly encoded – An image file without embedded text – Occurrence: ~3 -6% www. ukoln. ac. uk A centre of expertise in digital information management
Character sets • Ligatures, • Accents, • Symbols may not always be extractable from PDFs Image © Daniel Ullrich www. ukoln. ac. uk A centre of expertise in digital information management
Document formats/layouts • Many possible formats • Some formats not widely supported • Document layouts vary widely, esp. by discipline www. ukoln. ac. uk A centre of expertise in digital information management
1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage www. ukoln. ac. uk A centre of expertise in digital information management
Whatever can go wrong. . . (II) • • • Function following form – interface Model adapted to suit unique user needs Data model incompletely supported Input validation issues Systematic error; typos; localisation; encoding; etc. • Lots of past work in characterising manual input errors www. ukoln. ac. uk A centre of expertise in digital information management
1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input www. ukoln. ac. uk A centre of expertise in digital information management
Image segmentation, templating & OCR www. ukoln. ac. uk A centre of expertise in digital information management
Working from text • There a number of possible states (ie. title, author, email, affiliation, abstract) • Directed graph with probabilities – Markov chain: for example, Title Author Email www. ukoln. ac. uk A centre of expertise in digital information management Affil.
Hidden Markov Model • We cannot directly see these states – only the words • But we can gather statistics on the correlation between the words and the underlying states, to inform guesses as to how the data should be segmented • This may be expressed in terms of an HMM • Bayesian statistics used across term appearance www. ukoln. ac. uk A centre of expertise in digital information management
Example parse • Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE • . . . • Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE • Self-correcting, to the extent that the knowledge base grows as new papers are added to the collection www. ukoln. ac. uk A centre of expertise in digital information management
1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage www. ukoln. ac. uk A centre of expertise in digital information management
Aims • • Adaption of existing interfaces Enhancing rather than rewriting Cross-platform, accessible interface Simple reusable REST API, metadata as DC/XML www. ukoln. ac. uk A centre of expertise in digital information management
Sample interfaces www. ukoln. ac. uk A centre of expertise in digital information management
Sample interfaces www. ukoln. ac. uk A centre of expertise in digital information management
Architecture www. ukoln. ac. uk A centre of expertise in digital information management
Using what we know. . . www. ukoln. ac. uk A centre of expertise in digital information management
1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage www. ukoln. ac. uk A centre of expertise in digital information management
Question: • “Do people accept ‘hybrid’ interfaces? ” • Here’s one we did earlier… www. ukoln. ac. uk A centre of expertise in digital information management
Hypotheses • Correcting extracted metadata is faster than entering or cutting-and-pasting metadata. • The resulting metadata has fewer errors when the user is provided with already extracted metadata to correct. • User satisfaction with a system is higher if it 'tries' to extract metadata, even if it fails. • Measured: speed and accuracy of entering information manually versus hybrid entry, and qualitatively, the user-satisfaction www. ukoln. ac. uk A centre of expertise in digital information management
Results: Timing • Hybrid faster under both conditions • (Summary of median times) www. ukoln. ac. uk A centre of expertise in digital information management
Results: Accuracy • Tested against ground-truth • Keyword accuracy: First keyword listed was relevant for 46% of the publications. The top two were relevant in 66%; the top-5 cover 81% of all desired keywords. • Manual metadata accuracy: – Few users use cut and paste – Capitalisation, punctuation frequently differs – Synonyms are accidentally substituted • Hybrid closer to ground-truth, and more complete, but results not clear-cut. www. ukoln. ac. uk A centre of expertise in digital information management
Qualitative results • Most users preferred the hybrid mode • Most perceived it to be faster than manual data entry • Few believed the hybrid approach to be more accurate; in practice, there was no significant difference in quality between hybrid and manual approach • Both were good - quality www. ukoln. ac. uk A centre of expertise in digital information management
Discussion • Results support hypotheses • People prefer the hybrid interface, and found it more satisfying to use • Accessibility issues exist, but can be overcome • The punchline: one subject actually preferred manual entry because the hybrid system filled in metadata fields that he preferred to leave empty – ie. it did more than the subject wanted! www. ukoln. ac. uk A centre of expertise in digital information management
1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage www. ukoln. ac. uk A centre of expertise in digital information management
Met. Re prototype (2008) • Characteristic classes of individual/systematic error highlighted • Nb. local and general best practice. Uses: ranking, browsing, correcting systematic error • Uses info from intra-/inter-repository harvested metadata to identify patterns, rank occurrences and co-occurrences www. ukoln. ac. uk A centre of expertise in digital information management
v www. ukoln. ac. uk A centre of expertise in digital information management
www. ukoln. ac. uk A centre of expertise in digital information management
Issues • Discipline/domain-specific issues • Lots of information required to do this right (see metadata schema/terminology registry) • Some APs present particular difficulties, such as SWAP (FRBR structure, linking objects by ‘Scholarly Work’) www. ukoln. ac. uk A centre of expertise in digital information management
Approach • Generally dependent on heuristics over available data • Powered by very specific functions (classifiers, validation, etc…) • Potentially expensive, not always domain-independent www. ukoln. ac. uk A centre of expertise in digital information management
Future work • More! – Data – Filters (input/output formats) – Methods – Evaluation – Service availability (mail me for announcements!) www. ukoln. ac. uk A centre of expertise in digital information management
Conclusion • Metadata creation can be supported through software • Specific problem sets in metadata triage • Work continues in the Fix. Rep project www. ukoln. ac. uk A centre of expertise in digital information management
Conclusion (II) • Formal Metadata Extraction/evaluation • Metadata review process • Accessibility metadata • Entity extraction (named entities, geographical, temporal [k-int!]) • Repository integration www. ukoln. ac. uk A centre of expertise in digital information management
• Thanks! • Comments/Questions? • www. ukoln. ac. uk/projects/fixrep www. ukoln. ac. uk A centre of expertise in digital information management


