0cbb0ef3794a05a3a0443b71eb027577.ppt
- Количество слайдов: 50
Combining Ontologies and Natural Language Processing for Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U. S. A.
Natural Language Processing 2
NLP Influences ❖ Linguistics v computational ❖ Computer linguistics Science v theory: formal grammars, languages, models v implementation ❖ Cognitive Science ❖ Information Science ❖ Mathematics v statistics http: //isquared. wordpress. com/2011/03/31/the-role-of-natural-language-processing-in-information-retrieval/ © Franz J. Kurfess 3
NLP Approaches ❖ symbolic v complex models based on grammatical structures v linguistic foundations v often built using rule-based systems v very knowledge-intensive ❖ statistical v derived from analyses of large natural language corpora v strongly relies on likelihoods of words appearing in close proximity v very data-intensive ❖ hybrid v combination of symbolic and statistical approaches © Franz J. Kurfess 4
NLP Levels ❖ syntactic analysis (parsing) v analyzing statements with respect to a grammar v consideration of probabilities with respect to a corpus v => probabilistic context-free grammar (PCFG) ❖ semantic interpretation v extension of context-free grammars v => lexicalized PCFGs © Franz J. Kurfess 5
Word Level ❖ lexical analysis (tokenization) v ❖ identification of words stop word removal v sometimes stop words aren’t stop words v v ❖ The Who, The To be or not to be stemming v eliminate variations of the same word v ❖ singular, plural, tenses morphology v prefixes, postfixes, v compound words v Bayerischer Tiefbaugenossenschaftsangestelltengewerkschaftspräsident © Franz J. Kurfess 6
Sentence Level ❖ part of speech (POS) tagging v mapping v noun, of words to roles in a sentence verb, determiner ❖ sentence v usually v Mr. boundaries punctuation, but plenty of exceptions Jones shouted “Stop!” - but to no avail. . . ❖ parsing v possible syntactical mappings of a sentence to a grammar © Franz J. Kurfess 7
Ambiguity ❖ may appear on several levels v word level v homonyms v different words pronounced the same way v synonyms v different words with the same or similar meanings © Franz J. Kurfess 8
NLP Models ❖ acoustic model v speech patterns v likelihood of a particular sequence of sounds ❖ language v likelihood ❖ mental model of a certain string of words model v intention of the speaker to communicate a fact to the listener ❖ world model v corresponding proposition in the real world © Franz J. Kurfess 9
NLP Problems ❖ Synonymy v ❖ Polysemy v ❖ time, place, intent, background knowledge, assumptions Language variations v ❖ many ways to express an idea Context v ❖ order of words Flexibility v ❖ one word => many concepts Word grouping v ❖ one concept => many words evolution over time, local dialects, domain-specific terms and phrases Figures of speech v metaphors, analogies http: //isquared. wordpress. com/2011/03/31/the-role-of-natural-language-processing-in-information-retrieval/ © Franz J. Kurfess 10
Text Analytics ❖ extraction of structure and meaning from text documents v most text documents are not or loosely structured ❖ techniques used v linguistic v analytical v predictive © Franz J. Kurfess 11
Named Entity Recognition (NER) 12
Named Entity Recognition ❖ identifies references to entities with names in text documents v people, locations, events v special data types like ZIP codes, dates, times ❖ distinction v v between different types of terms noise v stop words and utterances that are not meaningful index terms v meaningful words entities v objects or concepts v typically denoted by nouns or placeholders named entities v labelled to make them distinguishable © Franz J. Kurfess 13
Named Entity Extraction ❖ conversion of information about named entities in text documents into a format suitable for computers v data ❖ often base, knowledge base, RDF repository combined with named entity recognition © Franz J. Kurfess 14
Named Entity Extraction Techniques ❖ patterns v regular expressions ❖ dictionary ❖ context ❖ corpus ❖ document v structure headings, captions ❖ formatting v style, font, CSS ❖ appearance http: //www. enterprisesearchblog. com/2008/06/13 -powerful-ent. html © Franz J. Kurfess 15
Named Entity Consolidation ❖ unification of multiple references to the same named entity v within a single document v across multiple documents ❖ references ❖ spelling within sentences variations ❖ synonyms © Franz J. Kurfess 16
NER Performance ❖ systems are often domain-specific v limited vocabulary v reduced ambiguity ❖ in some contexts, close to human performance v 93. 39% f-measure for the best system at MUC-7 v human annotators around 97% © Franz J. Kurfess 17
NER Tools and Systems ❖ Open. Calais (http: //www. opencalais. com/) v NER system available from Thomson Reuters v applications v semantic analysis of Web sites and blog posts v v v CBS Interactive/CNET Wordpress, Drupal, Zemanta, API for content submission v ❖ Newssift - Financial Times v ❖ http: //www. opencalais. com/documentation/calais-web-service-api short-lived, apparently not commercially viable Open. Text (http: //www. opentext. com/) v framework for industry-specific enterprise content management solution v acquired Nstein’s Text Mining Engine (TME) v http: //www. nstein. com/en/resources/product-documentation/ © Franz J. Kurfess 18
Repositories 19
Repository Types ❖ Index v list of occurrences of strings that point to the original documents ❖ file system ❖ (relational) data base v ❖ set of records based on one or more tables non-relational data base (No-SQL) v v often used for Web-scale shallow objects v ❖ more flexible internal structure examples: Hadoop, Hbase, Cassandra RDF repository (triple store) v low-level storage facility v relies on simple statements that connect two entities through a relation v object, attribute, value, v subject, predicate, object © Franz J. Kurfess 20
Repository Examples ❖ Wikipedia ❖ dbpedia ❖ Freebase ❖ Cyc © Franz J. Kurfess 21
RDF Repositories ❖ also known as “triple stores” ❖ often ❖ see combined with ontology managers W 3 C Web site for an overview v http: //www. w 3. org/wiki/Large. Triple. Stores v http: //www. w 3. org/wiki/Semantic. Web. Tools#RDF_Triple_Store_Systems ❖ open. RDF. org ❖ Open. Link Virtuoso ❖ Big. OWLIM ❖ Allegrograph ❖ Oracle Franz Inc Spatial 11 g © Franz J. Kurfess 22
Triple Store Evaluation ❖ functional evaluation ❖ performance evaluation ❖ see Triple Store Evaluation Analysis Report by Revelytix, Inc. v http: //www. revelytix. com/sites/default/files/Triple. Store. Evaluatio n. Analysis. Results. pdf v report states “Confidential, do not distribute without permission of Revelytix” but is available on their Web site © Franz J. Kurfess 23
Ontologies 24
Ontology ❖ examines the relationships between words, and the corresponding concepts and objects v in practice, it often combines aspects of v thesaurus, dictionary, taxonomy, concept map, topic map v frequently uses a graph-based visual representation to indicated relationships between words ❖ used to identify and specify a vocabulary for a particular subject or task © Franz J. Kurfess 25
From Taxonomies to Ontologies ❖ Taxonomy v strict hierarchy ❖ Thesaurus v hierarchy plus synonyms and other relations between words ❖ Topic v Map additional relations between concepts v across v the hierarchy properties of concepts ❖ Ontology v rules specifying the structure of the concept space v instances of concepts © Franz J. Kurfess 26
Menu Taxonomy Object Person Student Topic Researcher Doctoral Student Document Semantics Ph. D Student F-Logic Ontology Taxonomy : = Segmentation, classification and ordering of elements into a classification system according to their relationships between each other [Hotho, Sure, 2003] 27
Menu Thesaurus Object Person Student Topic Researcher Doktoral Student Document Semantics Ph. D Student F-Logic synonym Ontology similar • Terminology for specific domain • Graph with primitives, 2 fixed relationships (similar, synonym), sometimes additional relationships (antonym, homonym, . . . ) • originated from bibliography [Hotho, Sure, 2003] 28
Menu Topic Map Object Person knows Topic described_in Document writes Student Researcher Semantics Doktoral Student Ph. D Student F-Logic synonym Tel Ontology similar Affiliation • Topics (nodes), relationships and occurences (to documents) • ISO-Standard • typically for navigation and visualisation [Hotho, Sure, 2003] 29
Ontology Object is_a Person knows described_in Topic Document writes is_a Student Researcher F-Logic Semantics is_a Affiliation Ontology similar sub. Topic. Of Doktoral Student Ph. D Student Rules instance_of Tel T described_in D Affiliation York Sure +49 721 608 6592 P writes D is_about T Tis_about P knows D T AIFB • Representation Language: Predicate Logic (F-Logic) • Standards: RDF(S); coming up standard: OWL [Hotho, Sure, 2003] 30
Combining Ontologies and NLP 32
Benefits ❖ combination of “rich” knowledge representation with (semi-)automated knowledge acquisition v ontology provides the domain model v NLP facilitates extraction of knowledge from text documents ❖ improved NLP performance v resolution of ambiguities v richer context models © Franz J. Kurfess 33
Problems ❖ lack of ontologies ❖ computational ❖ semantic gap ❖ compatibility v e. g. overhead between different approaches domain model <=> statistical NLP © Franz J. Kurfess 34
Examples 35
Research ❖ AKT - Advanced Knowledge Technologies ❖ SACOT Knowledge Environment Framework © Franz J. Kurfess 36
ANNIE - Open Source Information Extraction ❖ AKT - Advanced Knowledge Technologies v (http: //www. aktors. org/technologies/retrieval/) v http: //www. aktors. org/technologies/annie/intro-image. png © Franz J. Kurfess 37
❖ SACOT Knowledge Environment Framework ❖ https: //analysis. mitre. org/proceedings/Final_Papers _Files/359_Camera_Ready_Paper. pdf © Franz J. Kurfess 38
Artequakt [Alani et al. , 2003] ❖ [Alani et al. , 2003] Automatic ontology-based knowledge extraction from Web documents. Intelligent Systems, IEEE, 18(1): 14 – 21. © Franz J. Kurfess 39
Artequakt Architecture [Alani et al. , 2003] © Franz J. Kurfess 40
Artequakt Knowledge Extraction (a) extraction results (b) knowledge triples © Franz J. Kurfess 41 [Alani et al. , 2003]
Artequakt Ontology ❖ automatic ontology population (a) XML file of extracted information (b) correspondin g instances and relationships [Alani et al. , 2003] © Franz J. Kurfess 42
Artequakt Query Templates ❖ common queries that are resolved into the final text © Franz J. Kurfess [Alani et al. , 2003] 43
Artequakt Biography Result ❖ final rendered biography v based on the previously extracted information © Franz J. Kurfess [Alani et al. , 2003] 44
Textrunner ❖ demo system for Open Information Extraction v http: //ai. cs. washington. edu/projects/open-information-extraction © Franz J. Kurfess 45
Textrunner Search ❖ Query: “What kills bacteria? ” © Franz J. Kurfess [Etzioni et al. , 2008] 46
Commercial Products ❖ status v lack and background often unclear of information © Franz J. Kurfess 47
Conclusions 48
References 49
References 2011 -05 -02 [Alani et al. , 2003] Alani, H. , Kim, S. , Millard, D. , Weal, M. , Hall, W. , Lewis, P. , and Shadbolt, N. (2003). Automatic ontology-based knowledge extraction from web documents. Intelligent Systems, IEEE, 18(1): 14 – 21. [Aguado de Cea et al. , 2002] Aguado de Cea, G. , A lvarez-de Mon, I. , Pareja-Lora, A. , and Plaza-Arteche, R. (2002). Rdf(s)/xml linguistic annotation of semantic web pages. In Proceedings of the 2 nd work- shop on NLP and XML - Volume 17, NLPXML ’ 02, pages 1– 8, Stroudsburg, PA, USA. Association for Computational Linguistics. [Apache, 2010] Apache (2010). Apache UIMA. http: //uima. apache. org. [Assal et al. , 2010] Assal, H. , Seng, J. , Kurfess, F. , Schwarz, E. , and Pohl, K. (2010). Partnering enhanced- nlp with semantic analysis in support of information extraction. In 2 nd International Workshop on Ontology-Driven Software Engineering (ODi. SE 2010) at the ACM SPLASH 2010 Conference, Reno, Nevada, U. S. A. [Banko et al. , 2007] Banko, M. , Cafarella, M. J. , Soderland, S. , Broadhead, M. , and Etzioni, O. (2007). Open information extraction from the web. In IJCAI’ 07: Proceedings of the 20 th international joint conference on Artifical intelligence, pages 2670– 2676, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. [Bender et al. , 2003] Bender, O. , Och, F. J. , and Ney, H. (2003). Maximum entropy models for named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, CONLL ’ 03, pages 148– 151, Stroudsburg, PA, USA. Association for Computational Linguistics. [Borthwick et al. , 1998] Borthwick, A. , Sterling, J. , Agichtein, E. , and Grishman, R. (1998). Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the Sixth Workshop on Very Large Corpora, pages 152– 160. [Bradford, 2006] Bradford, R. (2006). Relationship Discovery in Large Text Collections Using Latent Se- mantic Indexing. In SIAM conference on Data Mining, workshop on Link Analysis, Counterterrorism and Security. DOI= http: //www. siam. org/meetings/sdm 06/workproceed/Link% 20 Analysis/15. pdf. [Chernov et al. , 2006] Chernov, S. , Iofciu, T. , Nejdl, W. , and Zhou, X. (2006). Extracting semantics rela- tionships between wikipedia categories. In [V olkel and Schaffert, 2006]. [Chiticariu et al. , 2010] Chiticariu, L. , Krishnamurthy, R. , Li, Y. , Reiss, F. , and Vaithyanathan, S. (2010). Domain adaptation of rule-based annotators for named-entity recognition tasks. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’ 10, pages 1002– 1012, Stroudsburg, PA, USA. Association for Computational Linguistics. [Cohen and Sarawagi, 2004] Cohen, W. W. and Sarawagi, S. (2004). Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’ 04, pages 89– 98, New York, NY, USA. ACM. [Durville and Gandon, 2009] Durville, P. and Gandon, F. (2009). Filling the gap between web 2. 0 tech- nologies and natural language processing pipelines with semantic web. Advances in Semantic Processing, International Conference on, pages 109– 112. [Ekbal and Bandyopadhyay, 2009] Ekbal, A. and Bandyopadhyay, S. (2009). Improving the performance of a ner system by post-processing, context patterns and voting. In Li, W. and Moll a-Aliod, D. , editors, Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy, volume 5459 of Lecture Notes in Computer Science, pages 45– 56. Springer Berlin / Heidelberg. [Esuli and Sebastiani, 2010] Esuli, A. and Sebastiani, F. (2010). Evaluating information extraction. In Proceedings of the 2010 international conference on Multilingual and multimodal information access eval- uation: cross-language evaluation forum, CLEF’ 10, pages 100– 111, Berlin, Heidelberg. Springer-Verlag. [Etzioni et al. , 2008] Etzioni, O. , Banko, M. , Soderland, S. , and Weld, D. S. (2008). Open information extraction from the web. Commun. ACM, 51: 68– 74. [Gruhl et al. , 2009] Gruhl, D. , Nagarajan, M. , Pieper, J. , Robson, C. , and Sheth, A. (2009). Context and domain knowledge enhanced entity spotting in informal text. In Bernstein, A. , Karger, D. , Heath, T. , Feigenbaum, L. , Maynard, D. , Motta, E. , and Thirunarayan, K. , editors, The Semantic Web - ISWC 2009, volume 5823 of Lecture Notes in Computer Science, pages 260– 276. Springer Berlin / Heidelberg. [Kim and Sengupta, 2007] Kim, H. M. and Sengupta, A. (2007). Extracting knowledge from xml document repository: a semantic web-based approach. Inf. Technol. and Management, 8: 205– 221. © Franz J. Kurfess 50
References cont. 2011 -05 -02 [Lawson et al. , 2010] Lawson, N. , Eustice, K. , Perkowitz, M. , and Yetisgen-Yildiz, M. (2010). Annotating large email datasets for named entity recognition with mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT ’ 10, pages 71– 79, Stroudsburg, PA, USA. Association for Computational Linguistics. [Li and Tsai, 2010] Li, S. -T. and Tsai, F. -C. (2010). Constructing tree-based knowledge structures from text corpus. Applied Intelligence, pages 1– 12. 1007/s 10489 -010 -0243 -2. [Lin et al. , 2010] Lin, C. X. , Zhao, B. , Weninger, T. , Han, J. , and Liu, B. (2010). Entity relation discovery from web tables and links. In WWW ’ 10: Proceedings of the 19 th international conference on World wide web, pages 1145– 1146, New York, NY, USA. ACM. [Lin et al. , 2009] Lin, T. , Etzioni, O. , and Fogarty, J. (2009). Identifying interesting assertions from the web. In CIKM ’ 09: Proceeding of the 18 th ACM conference on Information and knowledge management, pages 1787– 1790, New York, NY, USA. ACM. [Marrero et al. , 2010] Marrero, M. , Urbano, J. , Morato, J. , and S anchez-Cuadrado, S. (2010). On the definition of patterns for semantic annotation. In Proceedings of the third workshop on Exploiting semantic annotations in information retrieval, ESAIR ’ 10, pages 15– 16, New York, NY, USA. ACM. [Open. Calais, 2010] Open. Calais (2010). Opencalais. http: //www. opencalais. com. [Open. NLP, 2010] Open. NLP (2010). Open. NLP. http: //opennlp. sourceforge. net/. [Preda et al. , 2010] Preda, N. , Kasneci, G. , Suchanek, F. M. , Neumann, T. , Yuan, W. , and Weikum, G. (2010). Active knowledge: dynamically enriching rdf knowledge bases by web services. In SIGMOD ’ 10: Proceedings of the 2010 international conference on Management of data, pages 399– 410, New York, NY, USA. ACM. [Reeve and Han, 2005] Reeve, L. and Han, H. (2005). Survey of semantic annotation platforms. In Proceed- ings of the 2005 ACM symposium on Applied computing, SAC ’ 05, pages 1634– 1638, New York, NY, USA. ACM. [Roberson and Dicheva, 2007] Roberson, S. and Dicheva, D. (2007). Semi-automatic ontology extraction to create draft topic maps. In Proceedings of the 45 th annual southeast regional conference, ACM-SE 45, pages 100– 105, New York, NY, USA. ACM. [Russell and Norvig, 2009] Russell, S. and Norvig, P. (2009). Artificial Intelligence: A Modern Approach. Prentice Hall Press, Upper Saddle River, NJ, USA, 3 rd edition. [Sekine, 2007] Sekine, S. (January 2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30: 3– 26. [Seon et al. , 2001] Seon, C. , Ko, Y. , Kim, J. , and Seo, J. (2001). Named Entity Recognition using Machine Learning Methods and Pattern-Selection Rules. In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, pages 229– 236. Citeseer. [Stanford NER, 2010] Stanford NER (2010). Stanford named entity recognizer. http: //nlp. stanford. edu/software/CRF-NER. shtml. [Tao et al. , 2008] Tao, X. , Li, Y. , Zhong, N. , and Nayak, R. (2008). An ontology-based framework for knowl- edge retrieval. In Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT ’ 08. IEEE/WIC/ACM International Conference on, volume 1, pages 510 – 517. [Vargas-Vera and Celjuska, 2004] Vargas-Vera, M. and Celjuska, D. (2004). Event recognition on news stories and semi-automatic population of an ontology. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, WI ’ 04, pages 615– 618, Washington, DC, USA. IEEE Computer Society. [Volkel and Schaffert, 2006] Volkel, M. and Schaffert, S. , editors (2006). Proceedings of the First Workshop on Semantic Wikis – From Wiki To Semantics, Workshop on Semantic Wikis. ESWC 2006. [Vrandecic and Kr otzsch, 2006] Vrandecic, D. and Kr otzsch, M. (2006). Reusing ontological background knowledge in semantic wikis. In [V olkel and Schaffert, 2006]. [Wang, 2003] Wang, Z. -H. (2003). Name entity recognition using language models. In Automatic Speech Recognition and Understanding, 2003. ASRU ’ 03. 2003 IEEE Workshop on, pages 554 – 559. [Weikum and Theobald, 2010] Weikum, G. and Theobald, M. (2010). From information to knowledge: har- vesting entities and relationships from web sources. In PODS ’ 10: Proceedings of the twenty-ninth ACM SIGMODSIGACT-SIGART symposium on Principles of database systems of data, pages 65– 76, New York, NY, USA. ACM. [Wimalasuriya and Dou, 2010] Wimalasuriya, D. C. and Dou, D. (2010). Ontology-based information extrac- tion: An introduction and a survey of current approaches. Journal of Information Science, 36(3): 306– 323. [Word. Net, 2010] Word. Net (2010). Wordnet. http: //wordnet. princeton. edu. [Yates et al. , 2007] Yates, A. , Cafarella, M. , Banko, M. , Etzioni, O. , Broadhead, M. , and Soderland, S. (2007). Textrunner: open information extraction on the web. In NAACL ’ 07: Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations on XX, pages 25– 26, Morristown, NJ, USA. Association for Computational Linguistics. [Zhu et al. , 2009] Zhu, J. , Nie, Z. , Liu, X. , Zhang, B. , and rong Wen, J. (2009). Statsnowball: a statistical approach to extracting entity relationships. In WWW 2009 MADRID - Track: Data Mining / Session: Statistical Meth © Franz J. Kurfess 51
0cbb0ef3794a05a3a0443b71eb027577.ppt