Скачать презентацию Approaches to automated metadata extraction Fix Rep Скачать презентацию Approaches to automated metadata extraction Fix Rep

f469a12bd7670f14b3f8f8210e5799cd.ppt

  • Количество слайдов: 40

Approaches to automated metadata extraction : Fix. Rep Project Emma Tonkin e. tonkin@ukoln. ac. Approaches to automated metadata extraction : Fix. Rep Project Emma Tonkin e. tonkin@ukoln. ac. uk UKOLN is supported by: www. ukoln. ac. uk A centre of expertise in digital information management www. bath. ac. uk

Wouldn't it be nice if. . . • . . . computers could author Wouldn't it be nice if. . . • . . . computers could author our metadata for us, thus saving a lot of hassle? • Mechanical metadata extraction vs manual metadata input www. ukoln. ac. uk A centre of expertise in digital information management

But. . . • Automated tools are fallible • There's never quite enough information But. . . • Automated tools are fallible • There's never quite enough information available • Templates change, different domains have different standards • In short, computers are often wrong – and so are people www. ukoln. ac. uk A centre of expertise in digital information management

The 'half a loaf' hypothesis • Hybrid approach: – Get what metadata you can The 'half a loaf' hypothesis • Hybrid approach: – Get what metadata you can – Ask the user to check and clean it if necessary • Philosophy: – If the computer gets it wrong, we can fix it later www. ukoln. ac. uk A centre of expertise in digital information management

Wouldn’t it be nice if… • …computers could fix our metadata for us? • Wouldn’t it be nice if… • …computers could fix our metadata for us? • Or, more realistically, help us do this work for ourselves. www. ukoln. ac. uk A centre of expertise in digital information management

 • All about ‘fixing it later’, doing what we can with what we • All about ‘fixing it later’, doing what we can with what we have • Automated metadata extraction + metadata consistency assessment • Metadata generation, evaluation, characterisation: enabling metadata triage www. ukoln. ac. uk A centre of expertise in digital information management

1)Challenges in automated metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use 1)Challenges in automated metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage www. ukoln. ac. uk A centre of expertise in digital information management

Whatever can go wrong. . . • PDFs can be: – Encrypted – Corrupted Whatever can go wrong. . . • PDFs can be: – Encrypted – Corrupted – Oddly encoded – An image file without embedded text – Occurrence: ~3 -6% www. ukoln. ac. uk A centre of expertise in digital information management

Character sets • Ligatures, • Accents, • Symbols may not always be extractable from Character sets • Ligatures, • Accents, • Symbols may not always be extractable from PDFs Image © Daniel Ullrich www. ukoln. ac. uk A centre of expertise in digital information management

Document formats/layouts • Many possible formats • Some formats not widely supported • Document Document formats/layouts • Many possible formats • Some formats not widely supported • Document layouts vary widely, esp. by discipline www. ukoln. ac. uk A centre of expertise in digital information management

1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as 1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage www. ukoln. ac. uk A centre of expertise in digital information management

Whatever can go wrong. . . (II) • • • Function following form – Whatever can go wrong. . . (II) • • • Function following form – interface Model adapted to suit unique user needs Data model incompletely supported Input validation issues Systematic error; typos; localisation; encoding; etc. • Lots of past work in characterising manual input errors www. ukoln. ac. uk A centre of expertise in digital information management

1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as 1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input www. ukoln. ac. uk A centre of expertise in digital information management

Image segmentation, templating & OCR www. ukoln. ac. uk A centre of expertise in Image segmentation, templating & OCR www. ukoln. ac. uk A centre of expertise in digital information management

Working from text • There a number of possible states (ie. title, author, email, Working from text • There a number of possible states (ie. title, author, email, affiliation, abstract) • Directed graph with probabilities – Markov chain: for example, Title Author Email www. ukoln. ac. uk A centre of expertise in digital information management Affil.

Hidden Markov Model • We cannot directly see these states – only the words Hidden Markov Model • We cannot directly see these states – only the words • But we can gather statistics on the correlation between the words and the underlying states, to inform guesses as to how the data should be segmented • This may be expressed in terms of an HMM • Bayesian statistics used across term appearance www. ukoln. ac. uk A centre of expertise in digital information management

Example parse • Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE • Example parse • Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE • . . . • Confirmation-Guided Discovery of First-Order Rules, PETER A. FLACH, NICOLAS LACHICHE • Self-correcting, to the extent that the knowledge base grows as new papers are added to the collection www. ukoln. ac. uk A centre of expertise in digital information management

1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as 1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage www. ukoln. ac. uk A centre of expertise in digital information management

Aims • • Adaption of existing interfaces Enhancing rather than rewriting Cross-platform, accessible interface Aims • • Adaption of existing interfaces Enhancing rather than rewriting Cross-platform, accessible interface Simple reusable REST API, metadata as DC/XML www. ukoln. ac. uk A centre of expertise in digital information management

Sample interfaces www. ukoln. ac. uk A centre of expertise in digital information management Sample interfaces www. ukoln. ac. uk A centre of expertise in digital information management

 Sample interfaces www. ukoln. ac. uk A centre of expertise in digital information Sample interfaces www. ukoln. ac. uk A centre of expertise in digital information management

Architecture www. ukoln. ac. uk A centre of expertise in digital information management Architecture www. ukoln. ac. uk A centre of expertise in digital information management

Using what we know. . . www. ukoln. ac. uk A centre of expertise Using what we know. . . www. ukoln. ac. uk A centre of expertise in digital information management

1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as 1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage www. ukoln. ac. uk A centre of expertise in digital information management

Question: • “Do people accept ‘hybrid’ interfaces? ” • Here’s one we did earlier… Question: • “Do people accept ‘hybrid’ interfaces? ” • Here’s one we did earlier… www. ukoln. ac. uk A centre of expertise in digital information management

Hypotheses • Correcting extracted metadata is faster than entering or cutting-and-pasting metadata. • The Hypotheses • Correcting extracted metadata is faster than entering or cutting-and-pasting metadata. • The resulting metadata has fewer errors when the user is provided with already extracted metadata to correct. • User satisfaction with a system is higher if it 'tries' to extract metadata, even if it fails. • Measured: speed and accuracy of entering information manually versus hybrid entry, and qualitatively, the user-satisfaction www. ukoln. ac. uk A centre of expertise in digital information management

Results: Timing • Hybrid faster under both conditions • (Summary of median times) www. Results: Timing • Hybrid faster under both conditions • (Summary of median times) www. ukoln. ac. uk A centre of expertise in digital information management

Results: Accuracy • Tested against ground-truth • Keyword accuracy: First keyword listed was relevant Results: Accuracy • Tested against ground-truth • Keyword accuracy: First keyword listed was relevant for 46% of the publications. The top two were relevant in 66%; the top-5 cover 81% of all desired keywords. • Manual metadata accuracy: – Few users use cut and paste – Capitalisation, punctuation frequently differs – Synonyms are accidentally substituted • Hybrid closer to ground-truth, and more complete, but results not clear-cut. www. ukoln. ac. uk A centre of expertise in digital information management

Qualitative results • Most users preferred the hybrid mode • Most perceived it to Qualitative results • Most users preferred the hybrid mode • Most perceived it to be faster than manual data entry • Few believed the hybrid approach to be more accurate; in practice, there was no significant difference in quality between hybrid and manual approach • Both were good - quality www. ukoln. ac. uk A centre of expertise in digital information management

Discussion • Results support hypotheses • People prefer the hybrid interface, and found it Discussion • Results support hypotheses • People prefer the hybrid interface, and found it more satisfying to use • Accessibility issues exist, but can be overcome • The punchline: one subject actually preferred manual entry because the hybrid system filled in metadata fields that he preferred to leave empty – ie. it did more than the subject wanted! www. ukoln. ac. uk A centre of expertise in digital information management

1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as 1)Challenges in metadata extraction 2)Manual metadata generation 3)Metadata extraction in brief 4)Practical use as part of a repository deposit workflow 5)A user study comparing manual and hybrid input 6)Towards metadata triage www. ukoln. ac. uk A centre of expertise in digital information management

Met. Re prototype (2008) • Characteristic classes of individual/systematic error highlighted • Nb. local Met. Re prototype (2008) • Characteristic classes of individual/systematic error highlighted • Nb. local and general best practice. Uses: ranking, browsing, correcting systematic error • Uses info from intra-/inter-repository harvested metadata to identify patterns, rank occurrences and co-occurrences www. ukoln. ac. uk A centre of expertise in digital information management

v www. ukoln. ac. uk A centre of expertise in digital information management v www. ukoln. ac. uk A centre of expertise in digital information management

 www. ukoln. ac. uk A centre of expertise in digital information management www. ukoln. ac. uk A centre of expertise in digital information management

Issues • Discipline/domain-specific issues • Lots of information required to do this right (see Issues • Discipline/domain-specific issues • Lots of information required to do this right (see metadata schema/terminology registry) • Some APs present particular difficulties, such as SWAP (FRBR structure, linking objects by ‘Scholarly Work’) www. ukoln. ac. uk A centre of expertise in digital information management

Approach • Generally dependent on heuristics over available data • Powered by very specific Approach • Generally dependent on heuristics over available data • Powered by very specific functions (classifiers, validation, etc…) • Potentially expensive, not always domain-independent www. ukoln. ac. uk A centre of expertise in digital information management

Future work • More! – Data – Filters (input/output formats) – Methods – Evaluation Future work • More! – Data – Filters (input/output formats) – Methods – Evaluation – Service availability (mail me for announcements!) www. ukoln. ac. uk A centre of expertise in digital information management

Conclusion • Metadata creation can be supported through software • Specific problem sets in Conclusion • Metadata creation can be supported through software • Specific problem sets in metadata triage • Work continues in the Fix. Rep project www. ukoln. ac. uk A centre of expertise in digital information management

Conclusion (II) • Formal Metadata Extraction/evaluation • Metadata review process • Accessibility metadata • Conclusion (II) • Formal Metadata Extraction/evaluation • Metadata review process • Accessibility metadata • Entity extraction (named entities, geographical, temporal [k-int!]) • Repository integration www. ukoln. ac. uk A centre of expertise in digital information management

 • Thanks! • Comments/Questions? • www. ukoln. ac. uk/projects/fixrep www. ukoln. ac. uk • Thanks! • Comments/Questions? • www. ukoln. ac. uk/projects/fixrep www. ukoln. ac. uk A centre of expertise in digital information management