fc20c7b80865caf31ab2b6e7e80cab11.ppt
- Количество слайдов: 52
GATE, SWAN and Semantic TV http: //gate. ac. uk/ Hamish Cunningham Department of Computer Science, University of Sheffield (52)
Contents 1. Language Technology and the Knowledge Economy 2. Information Extraction and Ontology. Based IE 3. GATE: infrastructure and IE in practice 4. Three examples: parallel data mining, digital libraries; video indexing 5. SWAN: OBIE meets the Web 6. Semantic TV 2(52)
The Knowledge Economy and Human Language Gartner, December 2002: • taxonomic and hierachical knowledge mapping and indexing will be prevalent in almost all information-rich applications • through 2012 more than 95% of human-to-computer information input will involve textual language A contradiction: • to deal with the information deluge we need formal knowledge in semantics-based systems • our communication culture is in informal and ambiguous natural language The challenge: to reconcile these two phenomena 3(52)
HLT: Closing the Loop (M)NLG Human Language Controlled Language KEY MNLG: Multilingual Natural Language Generation OIE: Ontology-aware Information Extraction AIE: Adaptive IE CLIE: Controlled Language IE Formal Knowledge (ontologies and instance bases) OIE (A)IE CLIE 4(52) Semantic Web; Semantic Grid; Semantic Web Services
Contents 1. Language Technology and the Knowledge Economy 2. Information Extraction and Ontology. Based IE 3. GATE: infrastructure and IE in practice 4. Three examples: parallel data mining, digital libraries; video indexing 5. SWAN: OBIE meets the Web 6. Semantic TV 5(52)
Information Extraction • Information Extraction (IE) pulls facts and structured information from the content of large text collections. • Contrast IE and Information Retrieval • NLP history: from NLU to IE • Progress driven by quantitative measures • MUC: Message Understanding Conferences • ACE: Advanced Content Extraction • Co. NLL: Conference on Nat. Lang. Learning • Pascal (2005): ontology-based IE 6(52)
Conventional IE Example “The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc. ” • NE: "rocket", "Tuesday", "Dr. Head“, "We Build Rockets" • CO: "it" = rocket; "Dr. Head" = "Dr. Big Head" • TE: the rocket is "shiny red" and Head's "brainchild". • TR: Dr. Head works for We Build Rockets Inc. • ST: rocket launch event with various participants 7(52)
Ontology-based IE XYZ was established on 03 November 1978 in London. It opened a plant in Bulgaria in … Ontology & KB Company Location HQ City type XYZ part. Of Country type HQ type London establ. On type part. Of “ 03/11/1978” UK 8(52) Bulgaria
Ontology-Based IE (OBIE) • Conventional IE tags selected segments of text whenever that text represents the name of an entity • OBIE: view enitites as mentions of the underlying instances from the ontology • Identify which mentions in the text refer to which instances in the ontology • Add new instances if needed • Identify instances of attributes and relations – take into account what are allowed given the ontology, using domain&range as constraints 9(52)
Classes, instances & metadata “Gordon Brown met George Bush during his two day visit.
Classes, instances & metadata (2) “Gordon Brown met Tony Blair to discuss the university tuition fees.
Challenges for IE for Sem. Web • • • Portability – different and changing ontologies Different text types – structured, free, etc. Utilise ontology information where available Train from small amount of annotated text Output results wrt the given ontology – bridge the gap demonstrated in S-CREAM • Learn/Model at the right level – ontologies are hierarchical and data will get sparser the lower we go 12(52)
Deploying IE Domain specificity vs. task complexity: a necessary trade-off general 100% specificity 90% domain specific simple bag-of-words complexity entities relations 13(52) Performance Level acceptable accuracy complex events 80% 30%
Contents 1. Language Technology and the Knowledge Economy 2. Information Extraction and Ontology. Based IE 3. GATE: infrastructure and IE in practice 4. Three examples: parallel data mining, digital libraries; video indexing 5. SWAN: OBIE meets the Web 6. Semantic TV 14(52)
Software lifecycle in collaborative research 1. Project Proposal: We love each other. We can work so well together. We can hold workshops on Santorini together. We will solve all the problems of AI that our predecessors were too stupid to. 2. Analysis and Design: Stop work entirely, for a period of reflection and recuperation following the stress of attending the kick-off meeting in Luxembourg. 3. Implementation: Each developer partner tries to convince the others that program X that they just happen to have lying around on a dusty disk-drive meets the project objectives exactly and should form the centrepiece of the demonstrator. 4. Integration and Testing: The lead partner gets desperate and decides to hard-code the results for a small set of examples into the demonstrator, and have a fail-safe crash facility for unknown input ("well, you know, it's still a prototype. . . "). 5. Evaluation: Everyone says how nice it is, how it solves all sorts of terribly hard problems, and how if we had another grant we could go on to transform information processing the World over (or at least the European business travel industry). 15(52)
Infrastructure and Science • Physicists have supercolliders; medics have MRI scanners; HLT researchers have. . Perl? • Other relevant trends: – EU funds multi-site collaborative projects – Realisation of role of engineering in scalablility, reusablility, and portablility – Support for large data, in multiple media, languages, formats, and locations – Promotion of quantitative evaluation metrics • Hence GATE, a General Architecture for Text Engineering (est. 1995) 16(52)
GATE, a General Architecture for Text Engineering is. . . • An architecture A macro-level organisational picture for LE software systems. • A framework For programmers, GATE is an object-oriented class library that implements the architecture. • A development environment For language engineers, a graphical development environment. GATE comes with. . . • Free components, and wrappers for other people's • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. • Free software (LGPL) at http: //gate. ac. uk/download/ • Used by thousands of people at hundreds of sites 17(52)
A bit of a nuisance (GATE users) Thousands of users at hundreds of sites. A representative sample: • the American National Corpus project • the Perseus Digital Library project, Tufts University, US • Longman Pearson publishing, UK • Merck Kg. Aa, Germany • Canon Europe, UK • Knight Ridder, US • BBN (leading HLT research lab), US • SMEs inc. Sirma AI Ltd. , Bulgaria • DERI, Stanford, Imperial College, London, the University of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities • UK and EU projects inc. My. Grid, CLEF, dotkom, AMITIES, Cub Reporter, EMILLE, Poesia. . . 18(52) GATE team projects. Past: • Conceptual indexing: MUMIS: automatic semantic indices for sports video • MUSE, cross-genre entitiy finder • HSL, Health-and-safety IE • Old Bailey: collaboration with HRI on 17 th century court reports • Multiflora: plant taxonomy text analysis for biodiversity research e-science • ACE / TIDES: Arabic, Chinese NE • JHU summer w/s on semtagging • EMILLE: S. Asian languages corpus • h. Tech. Sight: chemical eng. K. portal Present: • Advanced Knowledge Technologies: € 12 m UK five site collaborative project • SEKT Semantic Knowledge Technology • Presto. Space MM Preservation/Access • Knowledge. Web Semantic Web • ETCSL Sumerian Digital Library • ENIRAF, MMKM networks Future: • New e. Content project LIRICS
GATE – components and services • HLT systems composed of components • GATE versions: – v 1: dynamic loading of shared object libraries with Tcl wrappers – v 2, v 3: Java beans with URL loading, XML metadata, produce web services externally – v 4: core web services (both produce and consume), new LIRICS project out of ISO TC 37/SC 4 (link up with SWS in SDK? ) 19(52)
GATE – infrastructure for semantic metadata extraction • Combines learning and rule-based methods (new work on mixed-initiative learning) • Allows combination of IE and IR • Enables use of large-scale linguistic resources for IE, such as Word. Net • Supports ontologies as part of IE applications Ontology-Based IE • Supports languages from Hindi to Chinese, Italian to German • Used in Onto. Text KIM, SDK, Text 2 Onto, . . . 20(52)
Contents 1. Language Technology and the Knowledge Economy 2. Information Extraction and Ontology. Based IE 3. GATE: infrastructure and IE in practice 4. Three examples: parallel data mining, digital libraries; video indexing 5. SWAN: OBIE meets the Web 6. Semantic TV 21(52)
Example 1: Massively Parallel Clustering and Classification • D 2 K (Data 2 Knowledge): data mining / machine learning with visual programming development tool • T 2 K: library of text processing modules built on D 2 K • Integrates data mining methods for prediction, discovery, and deviation detection, with information visualization tools • Offers a visual programming environment. • Distributed computing / parallel processing facilities. • From NCSA: http: //alg. ncsa. uiuc. edu/do/tools/t 2 k 22(52)
T 2 K demos; Email. Classification_GATE 23(52)
Email classification results 24(52)
GATE_IE_VIZ demo with component information 25(52)
Entities extracted from news corpus 26(52)
Document clustering using GATE feature extraction 27(52)
Dendogram of the clustering results 28(52)
Example 2: Digital Libraries • Greenstone: – Digital Library with automated ingestion, structuring and indexing – Full text and fielded search (Dublin Core) – GATE-based entity tagging – From Maori to Arabic, Russian to Chinese – UNESCO’s Information for All Programme • Perseus: – – One of the oldest and biggest humanities DLs Provides rich interlinking of related resources Models time and space via materials dates and locations GATE-based automated hyperlinking etc. 29(52)
Greenstone 30(52)
Perseus Time-line and geographic visualisation http: //www. perseus. tufts. edu/ 31(52)
Example 3: the MUMIS project • Multimedia Indexing and Searching Environment • Composite index of a multimedia programme from multiple sources in different languages • ASR, video processing, Information Extraction (Dutch, English, German), merging, user interface • University of Twente/CTIT, University of Sheffield, University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA • An important experimental result: multiple sources for same events can improve extraction quality – Presto. Space applications in news and sports archiving 32(52)
Semantic Query Not “goal Beckham” (includes e. g. missed goals, or “this was not a goal”) Instead: “goal events with scorer David Beckham” 33(52)
The results: England win! 34(52)
Contents 1. Language Technology and the Knowledge Economy 2. Information Extraction and Ontology. Based IE 3. GATE: infrastructure and IE in practice 4. Three examples: parallel data mining, digital libraries; video indexing 5. SWAN: OBIE meets the Web 6. Semantic TV 35(52)
SWAN: a Semantic Web Annotator • Collaboration between DERI/NUIG, Onto. Text and USFD, hosted at DERI • Large heap of IBM hardware in your server room • Objective: make the cooling fans run flat-out • Conceptual indexing of news or other web fractions • Quantitative media reporting • Annotated web workbench service • Custom knowledge services • Demo and poster at ESWS 36(52)
SWAN Scenarios (1) Financial Analysts • Indications of how a company is viewed: • How many instances predicting strong performance for a particular company are out there? Over the past year how has the profile of predictions for this company changed? How many positive/negative sentiments were expressed for the company? Marketing Strategists • Support campaign tuning today based on yesterday's results: • In this morning's IT press 7% of articles discussed your company. The average proportion of the article directly relating to your company was 33%. The figures for the other key players in your sector are summarised in the following table. . Extent of media coverage relative to spend events: • Company Y exhibited at Comdex. In the week following the exhibition 20% of the press that covered Comdex mentionned Y. 37(52)
SWAN Scenarios (2) PR Workers • Identify negative reporting events (to issue denials, obfuscations, bribes etc. ): • The table below summarises 12 negative reporting event concerning your company in the last 24 hours of IT news. . Media Analysts • A range of media metrics, e. g. the "media distance" between concepts and products/companies: • The media distance between your company and the subject of XML is 0. 09; for IBM the value is 0. 2. 38(52)
SWAN Scenarios (3) Sales • Generate "black books" - lists of contacts in the organisations for sales staff. • Business structures are continually changing and reported in the news. • Track works-for and joining and leaving reporting events Public Interest Services • In order to generate interest and to prototype the system we may wish to provide a free public service, for example about sport, or theatre and cinema alerts. 39(52)
KIM • Ontology (KIMO) + 200 K instances KB (5 m stmts) • Lookup phase marks mentions from the ontology • Combined with rule-based IE system to recognise new instances of concepts and relations • High ambiguity of instances with the same label – uses disambiguation step • Special KB enrichment stage where some of these new instances are added to the KB • Disambiguation uses an Entity Ranking algorithm, i. e. , priority ordering of entities with the same label based on corpus statistics (e. g. , Paris) Popov et al. KIM. ISWC’ 03 40(52)
OBIE in KIM Popov et al. KIM. ISWC’ 03 41(52)
SWAN Logical Architecture Web IE (64 bit) Focussed crawling Focussed crawling Focussed crawling IE crawling (32 bit) Annotation (Oracle) Web UI, Web services Knowledge base (Sesame) Service Users 42(52) UI Users
Cluster Controller 43(52)
SWAN: Status, Future Now • Hardware working, crawling and annotating news sites • IE tuning and evaluation in progress Next steps • Public demonstration service • More news, sports domain • More languages (parallel corpus, align, project markup, learn recogniser for new language) • Negative reporting events 44(52)
Contents 1. Language Technology and the Knowledge Economy 2. Information Extraction and Ontology. Based IE 3. GATE: infrastructure and IE in practice 4. Three examples: parallel data mining, digital libraries; video indexing 5. SWAN: OBIE meets the Web 6. Semantic TV 45(52)
Trend 1: DRM: end of civilisation as we know it • Digital Rights Management (DRM) civilisation as we know it controls how you consume media you buy • Has the potential to be linked with censorship and with invasive behaviour logging) • You can't make digital objects behave like physical objects - unless you totally control the hardware and the operating system • If someone does gain control, then we may end up finding that someone has given the contract for news and culture to Haliburton, for example 46(52)
Seconds out, round 5: file sharing is about to go social • Round 1: Napster's explosion • Round 2: Napster's demise • Round 3: P 2 P, Kazaa, Bit. Torrent • Round 4: RIAA sues the punters • Round 5: OSN + P 2 P, trust as referal 47(52)
Trend 2: the Biggest Innovation in Conversation Since the Table • Social software hits the mainstream: • Friendster, Linked. In, Orkut (On-line Social Networking, OSN) • Bloggs, Wikis, chat/IM, RSS/ATOM • How to run a better teleconference: add Wiki and IM 48(52)
Trend 3: Wintel vs. Consumer Electronics in the Home • The TV, cable and satelite, DVD, Hifi, radio and Tivo of several years' time will probably run from a single PC (which will also do web, email, . . . ) • There will be a battle between Wintel, offering highquality gaming and full-blown Windows, and more conventional consumer electronics approaches based on Linux and cheap hardware • The latter can probably capture some significant market share, having advantages such as: no viruses; better stability; cheap hardware; multi-user functions; fast boot; quiet running. . . 49(52)
What if. . . ? • What if these three trends combine? What if we get widespread open platform consumer electronics + OSN + P 2 P file sharing? • Ubiquitous on-line communities centred on shared content, with a working model of trust as referral • What if semantic technology provides the means of organising and interlinking the cross-over between TV and the web? • Killer application for OSN • Bandwidth sales for cable companies • Antidote to DRM 50(52)
Semantic TV: Features • NGTivo will not record what you ask, but will record a week of 60 channels and allow you to browse! Then we're starting to have DL type problems. • DIY SIn. S: • structured information spaces • technology trickle-down effect: everyone's a systems analyst • semantic wiki: the simplest shared semantics that could possibly work • Coordinated web/desktop search; temporal search • KIM/SWAN (pre-announced shows; text metrics / shows watched; semantic search) • SWAN re-defined as web mining for TV-related facts (OBIE for scraping) • • Celebrity-based indexing How do you say "record all the programmes with actor X after 2001 not set in Europe"? Sem. TV answer: "record all the programmes with actor X after 2001 not set in Europe" (This is done by CLIE and example-based authoring, not dialogue processing, QA or NLDB. ) Social networking based on TV preferences Recommender Video OCR Retro and multi-player/low graphics games 51(52)
• GATE: Summary – defacto standard for HLT R&D – emerging substrate for HLT in knowledge technologies – world-class IE • SWAN: scaling up OBIE • Semantic TV: coming soon, act now! More information: http: //gate. ac. uk/ (52)


