Introduction to Web Science Harvesting the SW Dr

Introduction to Web Science Harvesting the SW Dr Alexiei Dingli 1

Six challenges of the Knowledge Life Cycle • • • Acquire Model Reuse Retrieve Publish Maintain 2

Information Extraction vs. Retrieval IR IE 3

A couple of approaches … • Active learning to reduce annotation burden – Supervised learning – Adaptive IE – The Melita methodology • Automatic annotation of large repositories – Largely unsupervised – Armadillo 4

The Seminar Announcements Task • Created by Carnegie Mellon School of Computer Science • How to retrieve – – Speaker Location Start Time End Time • From seminar announcements received by email 5

Seminar Announcements Example Dr. Steals presents in Dean Hall at one am. becomes <speaker>Dr. Steals</speaker> presents in <location>Dean Hall</location> at <stime>one am</stime>. 6

Information Extraction Measures • How many documents out of the retrieved documents are relevant? • How many retrieved documents are relevant out of all the relevant documents? • Weighted harmonic mean of precision and recall 7

IE Measures Examples • If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and f-measure? 8

IE Measures Answers • If I ask the librarian to search for books on cars, there are 10 relevant books in the library and out of the 8 he found, only 4 seem to be relevant books. What is his precision, recall and f-measure? • Precision • Recall • F = 4/8 = 50% = 4/10 = 40% =(2*50*40)/(50+40) = 44. 4% 9

Adaptive IE • What is IE? – Automated ways of extracting unstructured or partially structured information from machine readable files • What is AIE? – Performs tasks of traditional IE – Exploits the power of Machine Learning in order to adapt to • complex domains having large amounts of domain dependent data • different sub-language features • different text genres 10 – Considers important the Usability and Accessibility of the system

Amilcare • Tool for adaptive IE from Web-related texts – Specifically designed for document annotation – Based on (LP)2 algorithm *Linguistic Patterns by Learning Patterns • Covering algorithm based on Lazy NLP • Trains with a limited amount of examples • Effective on different text types – free texts – semi-structured texts – Uses Gate and Annie for preprocessing 11

CMU: detailed results 1. Best overall accuracy 2. Best result on speaker field 3. No results below 75% 12

Gate • General Architecture for Text Engineering – provides a software infrastructure for researchers and developers working in NLP • Contains – Tokeniser – Gazetteers – Sentence Splitter – POS Tagger – Semantic Tagger (ANNIE) – Co-reference Resolution – Multi lingual support – Protégé – WEKA – many more exist and can be added • http: //www. gate. ac. uk 13

Annotation Current practice of annotation for knowledge identification and extraction is time consuming needs annotation by experts is complex Reduce burden of text annotation for Knowledge 14 Management

Different Annotation Systems • • • SGML TEX Xanadu Co. Note Com. Mentor Jot. Bot Third Voice Annotate. net The Annotation Engine • • Alembic The Gate Annotation Tool i. Markup, Yawas Mn. M, S-CREAM 15

Melita • Tool for assisted automatic annotation • Uses an Adaptive IE engine to learn how to annotate (no use of rule writing for adapting the system) • Users: annotates document samples • IE System: – Trains while users annotate – Generalizes over seen cases – Provides preliminary annotation for new documents • Performs smart ordering of documents • Advantages – Annotates trivial or previously seen cases – Focuses slow/expensive user activity on unseen cases – User mainly validates extracted information • Simpler & less error prone / Speeds up corpus annotation – The system learns how to improve its capabilities 16

Methodology: Melita Bare Text User Annotates Bootstrap Phase Amilcare Learns in background 17

Methodology: Melita Amilcare Annotates Bare Text User Annotates Checking Phase Learning in background from missing tags, mistakes 18

Methodology: Melita Support Phase Amilcare Annotates Bare Text User Corrects Corrections used to retrain 19

Smart ordering of Documents Bare Text User Annotates Tries to annotate all the documents and selects the document with partial annotations Learns annotations 20

Intrusivity • An evolving system is difficult to control • Goal: – Avoiding unwelcome/unreliable suggestions – Adapting proactivity to user’s needs • Method: – Allow users to tune proactivity – Monitor user reactions to suggestions 21

Methodology: Melita Control Panel Ontology defining concepts Document Panel 22

Results Tag Amount of Texts needed for training Prec Rec stime etime 20 20 84 96 63 72 60 30 82 61 100 75 70 location speaker 30 23

Future Work • Research better ways of annotating concepts in documents • Optimise document ordering to maximise the discovery of new tags • Allow users to edit the rules • Learn to discover relationships !! • Not only suggest but also corrects user 24 annotations !!

Annotation for the Semantic Web • Semantic Web requires document annotation – Current approaches • Manual (e. g. Ontomat) or semi-automatic (Mn. M, S-Cream, Melita) • BUT: – Manual/Semi-automatic annotation of • Large diverse repositories • Containing different and sparse information is unfeasible • E. g. a Web site (So: 1, 600 pages) 25

Redundancy • Information on the Web (or large repositories) is Redundant • Information repeated in different superficial formats – – Databases/ontologies Structured pages (e. g. produced by databases) Largely structured pages (bibliography pages) 26 Unstructured pages (free texts)

The Idea • Largely unsupervised annotation of documents – Based on Adaptive Information Extraction – Bootstrapped using redundancy of information • Method – Use the structured information (easier to extract) to bootstrap learning on less structured sources (more difficult to extract) 27

Example: Extracting Bibliographies – Mines web-sites to extract biblios from personal pages Tasks: • • Finding people’s names Finding home pages Finding personal biblio pages Extract biblio references – Sources • • NE Recognition (Gate’s Annie) Citeseer/Unitrier (largely incomplete biblios) Google Homepagesearch 28

Mining Web sites (1) • Mines the site looking for People’s names • Uses • Generic patterns (NER) • Citeseer for likely bigrams • Looks for structured lists of names • Annotates known names • Trains on annotations to discover the HTML structure of the page • Recovers all names and hyperlinks 29

Experimental Results II - Sheffield • People – discovering who works in the department – using Information Integration • Total present in site 139 • Using generic patterns + online repositories – – 35 correct, 5 wrong Precision Recall F-measure 35 / 40 35 / 139 = = 87. 5 % 25. 2 % 39. 1 % • Errors – A. Schriffin – Eugenio Moggi – Peter Gray 30

Experimental Results IE - Sheffield • People – using Information Extraction • Total present in site 139 – – 116 correct, 8 wrong Precision Recall F-measure 116 / 124 116 / 139 = = 93. 5 % 88. 2 % • Errors – Speech and Hearing – European Network – Department Of • Enhancements – Lists, Postprocessor – Position Paper – The Network – To System 31

Experimental Results - Edinburgh • People – using Information Integration • Total present in site 216 • Using generic patterns + online repositories – – 11 correct, 2 wrong Precision Recall F-measure 11 / 13 11 / 216 = = 84. 6 % 5. 1 % 9. 6 % = = 93. 9 % 70. 8 % 80. 7 % – using Information Extraction – – 153 correct, 10 wrong Precision 153 / 163 Recall 153 / 216 F-measure 32

Experimental Results - Aberdeen • People – using Information Integration • Total present in site 70 • Using generic patterns + online repositories – – 21 correct, 1 wrong Precision Recall F-measure 21 / 22 21 / 70 = = 95. 5 % 30. 0 % 45. 7 % = = 96. 9 % 90. 0 % 93. 3 % – using Information Extraction – – 63 correct, 2 wrong Precision Recall F-measure 63 / 65 63 / 70 33

Mining Web sites (2) • Annotates known papers • Trains on annotations to discover the HTML structure • Recovers co-authoring information 34

Experimental Results (1) • Papers – discovering publications in the department – using Information Integration • Total present in site 320 • Using generic patterns + online repositories – – 151 correct, 1 wrong Precision Recall F-measure 151 / 152 = 151 / 320 = 99 % 47 % 64 % • Errors - Garbage in database!! @misc{ computer-mining, author = "Department Of Computer", title = "Mining Web Sites Using Adaptive Information Extraction Alexiei Dingli and Fabio Ciravegna and David Guthrie and Yorick Wilks", url = "citeseer. nj. nec. com/582939. html" } 35

Experimental Results (2) • Papers – using Information Extraction • Total present in site 320 – – 214 correct, 3 wrong Precision Recall F-measure 214 / 217 214 / 320 = = 99 % 67 % 80 % • Errors – Wrong boundaries in detection of paper names! – Names of workshops mistaken as paper names! 36

Artists domain • Task – Given the name of an artist, find all the paintings of that artist. – Created for the Art. Equ. AKT project 37

Artists domain Evaluation Artist Method Caravaggio II 100. 0% 61% 75. 8% IE 100. 0% 98. 8% 99. 4% II 100. 0% 27. 1% 42. 7% IE 91. 0% 42. 6% 58. 0% II 100. 0% 29. 7% 45. 8% IE 100. 0% 40. 6% 57. 8% II 100. 0% 14. 6% 25. 5% IE 86. 3% 48. 5% 62. 1% II 100. 0% 59. 9% 74. 9% IE 96. 5% 86. 4% 91. 2% II 94. 7% 40. 0% 56. 2% IE 96. 4% 60. 0% 74. 0% Cezanne Manet Monet Raphael Renoir Precision Recall F-Measure 38

User Role – Providing … • A URL • List of services – Already wrapped (e. g. Google is in default library) – Train wrappers using examples • Examples of fillers (e. g. project names) – In case … • Correcting intermediate results • Reactivating Armadillo when paused 39

Armadillo – Library of known services (e. g. Google, Citeseer) – Tools for training learners for other structured sources – Tools for bootstrapping learning • From un/structured sources • No user annotation • Multi-strategy acquisition of information using redundancy – User-driven revision of results • With re-learning after user correction 40

Rationale • Armadillo learns how to extract information – From large repositories By integrating information – from diverse and distributed resources • Use: – – Ontology population Information highlighting Document enrichment Enhancing user experience 41

Data Navigation (1) 42

Data Navigation (2) 43

Data Navigation (3) 44

IE for SW: The Vision • Automatic annotation services – For a specific ontology – Constantly re-indexing/re-annotating documents – Semantic search engine • Effects: – No annotation in the document • As today’s indexes are not stored in the documents – No legacy with the past • Annotation with the latest version of the ontology • Multiple annotations for a single document – Simplifies maintenance • Page changed but not re-annotated 45

Questions? 46