Combining Ontologies and Natural Language Processing for Knowledge

Combining Ontologies and Natural Language Processing for Knowledge Retrieval Franz J. Kurfess Computer Science Department California Polytechnic State University San Luis Obispo, CA, U. S. A.

Natural Language Processing 2

NLP Influences ❖ Linguistics v computational ❖ Computer linguistics Science v theory: formal grammars, languages, models v implementation ❖ Cognitive Science ❖ Information Science ❖ Mathematics v statistics http: //isquared. wordpress. com/2011/03/31/the-role-of-natural-language-processing-in-information-retrieval/ © Franz J. Kurfess 3

NLP Approaches ❖ symbolic v complex models based on grammatical structures v linguistic foundations v often built using rule-based systems v very knowledge-intensive ❖ statistical v derived from analyses of large natural language corpora v strongly relies on likelihoods of words appearing in close proximity v very data-intensive ❖ hybrid v combination of symbolic and statistical approaches © Franz J. Kurfess 4

NLP Levels ❖ syntactic analysis (parsing) v analyzing statements with respect to a grammar v consideration of probabilities with respect to a corpus v => probabilistic context-free grammar (PCFG) ❖ semantic interpretation v extension of context-free grammars v => lexicalized PCFGs © Franz J. Kurfess 5

Word Level ❖ lexical analysis (tokenization) v ❖ identification of words stop word removal v sometimes stop words aren’t stop words v v ❖ The Who, The To be or not to be stemming v eliminate variations of the same word v ❖ singular, plural, tenses morphology v prefixes, postfixes, v compound words v Bayerischer Tiefbaugenossenschaftsangestelltengewerkschaftspräsident © Franz J. Kurfess 6

Sentence Level ❖ part of speech (POS) tagging v mapping v noun, of words to roles in a sentence verb, determiner ❖ sentence v usually v Mr. boundaries punctuation, but plenty of exceptions Jones shouted “Stop!” - but to no avail. . . ❖ parsing v possible syntactical mappings of a sentence to a grammar © Franz J. Kurfess 7

Ambiguity ❖ may appear on several levels v word level v homonyms v different words pronounced the same way v synonyms v different words with the same or similar meanings © Franz J. Kurfess 8

NLP Models ❖ acoustic model v speech patterns v likelihood of a particular sequence of sounds ❖ language v likelihood ❖ mental model of a certain string of words model v intention of the speaker to communicate a fact to the listener ❖ world model v corresponding proposition in the real world © Franz J. Kurfess 9

NLP Problems ❖ Synonymy v ❖ Polysemy v ❖ time, place, intent, background knowledge, assumptions Language variations v ❖ many ways to express an idea Context v ❖ order of words Flexibility v ❖ one word => many concepts Word grouping v ❖ one concept => many words evolution over time, local dialects, domain-specific terms and phrases Figures of speech v metaphors, analogies http: //isquared. wordpress. com/2011/03/31/the-role-of-natural-language-processing-in-information-retrieval/ © Franz J. Kurfess 10

Text Analytics ❖ extraction of structure and meaning from text documents v most text documents are not or loosely structured ❖ techniques used v linguistic v analytical v predictive © Franz J. Kurfess 11

Named Entity Recognition (NER) 12

Named Entity Recognition ❖ identifies references to entities with names in text documents v people, locations, events v special data types like ZIP codes, dates, times ❖ distinction v v between different types of terms noise v stop words and utterances that are not meaningful index terms v meaningful words entities v objects or concepts v typically denoted by nouns or placeholders named entities v labelled to make them distinguishable © Franz J. Kurfess 13

Named Entity Extraction ❖ conversion of information about named entities in text documents into a format suitable for computers v data ❖ often base, knowledge base, RDF repository combined with named entity recognition © Franz J. Kurfess 14

Named Entity Extraction Techniques ❖ patterns v regular expressions ❖ dictionary ❖ context ❖ corpus ❖ document v structure headings, captions ❖ formatting v style, font, CSS ❖ appearance http: //www. enterprisesearchblog. com/2008/06/13 -powerful-ent. html © Franz J. Kurfess 15

Named Entity Consolidation ❖ unification of multiple references to the same named entity v within a single document v across multiple documents ❖ references ❖ spelling within sentences variations ❖ synonyms © Franz J. Kurfess 16

NER Performance ❖ systems are often domain-specific v limited vocabulary v reduced ambiguity ❖ in some contexts, close to human performance v 93. 39% f-measure for the best system at MUC-7 v human annotators around 97% © Franz J. Kurfess 17

NER Tools and Systems ❖ Open. Calais (http: //www. opencalais. com/) v NER system available from Thomson Reuters v applications v semantic analysis of Web sites and blog posts v v v CBS Interactive/CNET Wordpress, Drupal, Zemanta, API for content submission v ❖ Newssift - Financial Times v ❖ http: //www. opencalais. com/documentation/calais-web-service-api short-lived, apparently not commercially viable Open. Text (http: //www. opentext. com/) v framework for industry-specific enterprise content management solution v acquired Nstein’s Text Mining Engine (TME) v http: //www. nstein. com/en/resources/product-documentation/ © Franz J. Kurfess 18

Repositories 19

Repository Types ❖ Index v list of occurrences of strings that point to the original documents ❖ file system ❖ (relational) data base v ❖ set of records based on one or more tables non-relational data base (No-SQL) v v often used for Web-scale shallow objects v ❖ more flexible internal structure examples: Hadoop, Hbase, Cassandra RDF repository (triple store) v low-level storage facility v relies on simple statements that connect two entities through a relation v object, attribute, value, v subject, predicate, object © Franz J. Kurfess 20

Repository Examples ❖ Wikipedia ❖ dbpedia ❖ Freebase ❖ Cyc © Franz J. Kurfess 21

RDF Repositories ❖ also known as “triple stores” ❖ often ❖ see combined with ontology managers W 3 C Web site for an overview v http: //www. w 3. org/wiki/Large. Triple. Stores v http: //www. w 3. org/wiki/Semantic. Web. Tools#RDF_Triple_Store_Systems ❖ open. RDF. org ❖ Open. Link Virtuoso ❖ Big. OWLIM ❖ Allegrograph ❖ Oracle Franz Inc Spatial 11 g © Franz J. Kurfess 22

Triple Store Evaluation ❖ functional evaluation ❖ performance evaluation ❖ see Triple Store Evaluation Analysis Report by Revelytix, Inc. v http: //www. revelytix. com/sites/default/files/Triple. Store. Evaluatio n. Analysis. Results. pdf v report states “Confidential, do not distribute without permission of Revelytix” but is available on their Web site © Franz J. Kurfess 23

Ontologies 24

Ontology ❖ examines the relationships between words, and the corresponding concepts and objects v in practice, it often combines aspects of v thesaurus, dictionary, taxonomy, concept map, topic map v frequently uses a graph-based visual representation to indicated relationships between words ❖ used to identify and specify a vocabulary for a particular subject or task © Franz J. Kurfess 25

From Taxonomies to Ontologies ❖ Taxonomy v strict hierarchy ❖ Thesaurus v hierarchy plus synonyms and other relations between words ❖ Topic v Map additional relations between concepts v across v the hierarchy properties of concepts ❖ Ontology v rules specifying the structure of the concept space v instances of concepts © Franz J. Kurfess 26

Menu Taxonomy Object Person Student Topic Researcher Doctoral Student Document Semantics Ph. D Student F-Logic Ontology Taxonomy : = Segmentation, classification and ordering of elements into a classification system according to their relationships between each other [Hotho, Sure, 2003] 27

Menu Thesaurus Object Person Student Topic Researcher Doktoral Student Document Semantics Ph. D Student F-Logic synonym Ontology similar • Terminology for specific domain • Graph with primitives, 2 fixed relationships (similar, synonym), sometimes additional relationships (antonym, homonym, . . . ) • originated from bibliography [Hotho, Sure, 2003] 28

Menu Topic Map Object Person knows Topic described_in Document writes Student Researcher Semantics Doktoral Student Ph. D Student F-Logic synonym Tel Ontology similar Affiliation • Topics (nodes), relationships and occurences (to documents) • ISO-Standard • typically for navigation and visualisation [Hotho, Sure, 2003] 29

Ontology Object is_a Person knows described_in Topic Document writes is_a Student Researcher F-Logic Semantics is_a Affiliation Ontology similar sub. Topic. Of Doktoral Student Ph. D Student Rules instance_of Tel T described_in D Affiliation York Sure +49 721 608 6592 P writes D is_about T Tis_about P knows D T AIFB • Representation Language: Predicate Logic (F-Logic) • Standards: RDF(S); coming up standard: OWL [Hotho, Sure, 2003] 30

Combining Ontologies and NLP 32

Benefits ❖ combination of “rich” knowledge representation with (semi-)automated knowledge acquisition v ontology provides the domain model v NLP facilitates extraction of knowledge from text documents ❖ improved NLP performance v resolution of ambiguities v richer context models © Franz J. Kurfess 33

Problems ❖ lack of ontologies ❖ computational ❖ semantic gap ❖ compatibility v e. g. overhead between different approaches domain model <=> statistical NLP © Franz J. Kurfess 34

Examples 35

Research ❖ AKT - Advanced Knowledge Technologies ❖ SACOT Knowledge Environment Framework © Franz

ANNIE - Open Source Information Extraction ❖ AKT - Advanced Knowledge Technologies v (http: //www. aktors. org/technologies/retrieval/) v http: //www. aktors. org/technologies/annie/intro-image. png © Franz J. Kurfess 37

❖ SACOT Knowledge Environment Framework ❖ https: //analysis. mitre. org/proceedings/Final_Papers _Files/359_Camera_Ready_Paper. pdf © Franz J. Kurfess 38

Artequakt [Alani et al. , 2003] ❖ [Alani et al. , 2003] Automatic ontology-based knowledge extraction from Web documents. Intelligent Systems, IEEE, 18(1): 14 – 21. © Franz J. Kurfess 39

Artequakt Architecture [Alani et al. , 2003] © Franz J. Kurfess 40

Artequakt Knowledge Extraction (a) extraction results (b) knowledge triples © Franz J. Kurfess 41

Artequakt Ontology ❖ automatic ontology population (a) XML file of extracted information (b) correspondin g instances and relationships [Alani et al. , 2003] © Franz J. Kurfess 42

Artequakt Query Templates ❖ common queries that are resolved into the final text © Franz J. Kurfess [Alani et al. , 2003] 43

Artequakt Biography Result ❖ final rendered biography v based on the previously extracted information © Franz J. Kurfess [Alani et al. , 2003] 44

Textrunner ❖ demo system for Open Information Extraction v http: //ai. cs. washington. edu/projects/open-information-extraction © Franz J. Kurfess 45

Textrunner Search ❖ Query: “What kills bacteria? ” © Franz J. Kurfess [Etzioni et

Commercial Products ❖ status v lack and background often unclear of information © Franz

Conclusions 48

References 49

References 2011 -05 -02 [Alani et al. , 2003] Alani, H. , Kim, S. , Millard, D. , Weal, M. , Hall, W. , Lewis, P. , and Shadbolt, N. (2003). Automatic ontology-based knowledge extraction from web documents. Intelligent Systems, IEEE, 18(1): 14 – 21. [Aguado de Cea et al. , 2002] Aguado de Cea, G. , A lvarez-de Mon, I. , Pareja-Lora, A. , and Plaza-Arteche, R. (2002). Rdf(s)/xml linguistic annotation of semantic web pages. In Proceedings of the 2 nd work- shop on NLP and XML - Volume 17, NLPXML ’ 02, pages 1– 8, Stroudsburg, PA, USA. Association for Computational Linguistics. [Apache, 2010] Apache (2010). Apache UIMA. http: //uima. apache. org. [Assal et al. , 2010] Assal, H. , Seng, J. , Kurfess, F. , Schwarz, E. , and Pohl, K. (2010). Partnering enhanced- nlp with semantic analysis in support of information extraction. In 2 nd International Workshop on Ontology-Driven Software Engineering (ODi. SE 2010) at the ACM SPLASH 2010 Conference, Reno, Nevada, U. S. A. [Banko et al. , 2007] Banko, M. , Cafarella, M. J. , Soderland, S. , Broadhead, M. , and Etzioni, O. (2007). Open information extraction from the web. In IJCAI’ 07: Proceedings of the 20 th international joint conference on Artifical intelligence, pages 2670– 2676, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc. [Bender et al. , 2003] Bender, O. , Och, F. J. , and Ney, H. (2003). Maximum entropy models for named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, CONLL ’ 03, pages 148– 151, Stroudsburg, PA, USA. Association for Computational Linguistics. [Borthwick et al. , 1998] Borthwick, A. , Sterling, J. , Agichtein, E. , and Grishman, R. (1998). Exploiting diverse knowledge sources via maximum entropy in named entity recognition. In Proceedings of the Sixth Workshop on Very Large Corpora, pages 152– 160. [Bradford, 2006] Bradford, R. (2006). Relationship Discovery in Large Text Collections Using Latent Se- mantic Indexing. In SIAM conference on Data Mining, workshop on Link Analysis, Counterterrorism and Security. DOI= http: //www. siam. org/meetings/sdm 06/workproceed/Link% 20 Analysis/15. pdf. [Chernov et al. , 2006] Chernov, S. , Iofciu, T. , Nejdl, W. , and Zhou, X. (2006). Extracting semantics rela- tionships between wikipedia categories. In [V olkel and Schaffert, 2006]. [Chiticariu et al. , 2010] Chiticariu, L. , Krishnamurthy, R. , Li, Y. , Reiss, F. , and Vaithyanathan, S. (2010). Domain adaptation of rule-based annotators for named-entity recognition tasks. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, EMNLP ’ 10, pages 1002– 1012, Stroudsburg, PA, USA. Association for Computational Linguistics. [Cohen and Sarawagi, 2004] Cohen, W. W. and Sarawagi, S. (2004). Exploiting dictionaries in named entity extraction: combining semi-markov extraction processes and data integration methods. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’ 04, pages 89– 98, New York, NY, USA. ACM. [Durville and Gandon, 2009] Durville, P. and Gandon, F. (2009). Filling the gap between web 2. 0 tech- nologies and natural language processing pipelines with semantic web. Advances in Semantic Processing, International Conference on, pages 109– 112. [Ekbal and Bandyopadhyay, 2009] Ekbal, A. and Bandyopadhyay, S. (2009). Improving the performance of a ner system by post-processing, context patterns and voting. In Li, W. and Moll a-Aliod, D. , editors, Computer Processing of Oriental Languages. Language Technology for the Knowledge-based Economy, volume 5459 of Lecture Notes in Computer Science, pages 45– 56. Springer Berlin / Heidelberg. [Esuli and Sebastiani, 2010] Esuli, A. and Sebastiani, F. (2010). Evaluating information extraction. In Proceedings of the 2010 international conference on Multilingual and multimodal information access eval- uation: cross-language evaluation forum, CLEF’ 10, pages 100– 111, Berlin, Heidelberg. Springer-Verlag. [Etzioni et al. , 2008] Etzioni, O. , Banko, M. , Soderland, S. , and Weld, D. S. (2008). Open information extraction from the web. Commun. ACM, 51: 68– 74. [Gruhl et al. , 2009] Gruhl, D. , Nagarajan, M. , Pieper, J. , Robson, C. , and Sheth, A. (2009). Context and domain knowledge enhanced entity spotting in informal text. In Bernstein, A. , Karger, D. , Heath, T. , Feigenbaum, L. , Maynard, D. , Motta, E. , and Thirunarayan, K. , editors, The Semantic Web - ISWC 2009, volume 5823 of Lecture Notes in Computer Science, pages 260– 276. Springer Berlin / Heidelberg. [Kim and Sengupta, 2007] Kim, H. M. and Sengupta, A. (2007). Extracting knowledge from xml document repository: a semantic web-based approach. Inf. Technol. and Management, 8: 205– 221. © Franz J. Kurfess 50

References cont. 2011 -05 -02 [Lawson et al. , 2010] Lawson, N. , Eustice, K. , Perkowitz, M. , and Yetisgen-Yildiz, M. (2010). Annotating large email datasets for named entity recognition with mechanical turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, CSLDAMT ’ 10, pages 71– 79, Stroudsburg, PA, USA. Association for Computational Linguistics. [Li and Tsai, 2010] Li, S. -T. and Tsai, F. -C. (2010). Constructing tree-based knowledge structures from text corpus. Applied Intelligence, pages 1– 12. 1007/s 10489 -010 -0243 -2. [Lin et al. , 2010] Lin, C. X. , Zhao, B. , Weninger, T. , Han, J. , and Liu, B. (2010). Entity relation discovery from web tables and links. In WWW ’ 10: Proceedings of the 19 th international conference on World wide web, pages 1145– 1146, New York, NY, USA. ACM. [Lin et al. , 2009] Lin, T. , Etzioni, O. , and Fogarty, J. (2009). Identifying interesting assertions from the web. In CIKM ’ 09: Proceeding of the 18 th ACM conference on Information and knowledge management, pages 1787– 1790, New York, NY, USA. ACM. [Marrero et al. , 2010] Marrero, M. , Urbano, J. , Morato, J. , and S anchez-Cuadrado, S. (2010). On the definition of patterns for semantic annotation. In Proceedings of the third workshop on Exploiting semantic annotations in information retrieval, ESAIR ’ 10, pages 15– 16, New York, NY, USA. ACM. [Open. Calais, 2010] Open. Calais (2010). Opencalais. http: //www. opencalais. com. [Open. NLP, 2010] Open. NLP (2010). Open. NLP. http: //opennlp. sourceforge. net/. [Preda et al. , 2010] Preda, N. , Kasneci, G. , Suchanek, F. M. , Neumann, T. , Yuan, W. , and Weikum, G. (2010). Active knowledge: dynamically enriching rdf knowledge bases by web services. In SIGMOD ’ 10: Proceedings of the 2010 international conference on Management of data, pages 399– 410, New York, NY, USA. ACM. [Reeve and Han, 2005] Reeve, L. and Han, H. (2005). Survey of semantic annotation platforms. In Proceed- ings of the 2005 ACM symposium on Applied computing, SAC ’ 05, pages 1634– 1638, New York, NY, USA. ACM. [Roberson and Dicheva, 2007] Roberson, S. and Dicheva, D. (2007). Semi-automatic ontology extraction to create draft topic maps. In Proceedings of the 45 th annual southeast regional conference, ACM-SE 45, pages 100– 105, New York, NY, USA. ACM. [Russell and Norvig, 2009] Russell, S. and Norvig, P. (2009). Artificial Intelligence: A Modern Approach. Prentice Hall Press, Upper Saddle River, NJ, USA, 3 rd edition. [Sekine, 2007] Sekine, S. (January 2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30: 3– 26. [Seon et al. , 2001] Seon, C. , Ko, Y. , Kim, J. , and Seo, J. (2001). Named Entity Recognition using Machine Learning Methods and Pattern-Selection Rules. In Proceedings of the Sixth Natural Language Processing Pacific Rim Symposium, pages 229– 236. Citeseer. [Stanford NER, 2010] Stanford NER (2010). Stanford named entity recognizer. http: //nlp. stanford. edu/software/CRF-NER. shtml. [Tao et al. , 2008] Tao, X. , Li, Y. , Zhong, N. , and Nayak, R. (2008). An ontology-based framework for knowl- edge retrieval. In Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT ’ 08. IEEE/WIC/ACM International Conference on, volume 1, pages 510 – 517. [Vargas-Vera and Celjuska, 2004] Vargas-Vera, M. and Celjuska, D. (2004). Event recognition on news stories and semi-automatic population of an ontology. In Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, WI ’ 04, pages 615– 618, Washington, DC, USA. IEEE Computer Society. [Volkel and Schaffert, 2006] Volkel, M. and Schaffert, S. , editors (2006). Proceedings of the First Workshop on Semantic Wikis – From Wiki To Semantics, Workshop on Semantic Wikis. ESWC 2006. [Vrandecic and Kr otzsch, 2006] Vrandecic, D. and Kr otzsch, M. (2006). Reusing ontological background knowledge in semantic wikis. In [V olkel and Schaffert, 2006]. [Wang, 2003] Wang, Z. -H. (2003). Name entity recognition using language models. In Automatic Speech Recognition and Understanding, 2003. ASRU ’ 03. 2003 IEEE Workshop on, pages 554 – 559. [Weikum and Theobald, 2010] Weikum, G. and Theobald, M. (2010). From information to knowledge: har- vesting entities and relationships from web sources. In PODS ’ 10: Proceedings of the twenty-ninth ACM SIGMODSIGACT-SIGART symposium on Principles of database systems of data, pages 65– 76, New York, NY, USA. ACM. [Wimalasuriya and Dou, 2010] Wimalasuriya, D. C. and Dou, D. (2010). Ontology-based information extrac- tion: An introduction and a survey of current approaches. Journal of Information Science, 36(3): 306– 323. [Word. Net, 2010] Word. Net (2010). Wordnet. http: //wordnet. princeton. edu. [Yates et al. , 2007] Yates, A. , Cafarella, M. , Banko, M. , Etzioni, O. , Broadhead, M. , and Soderland, S. (2007). Textrunner: open information extraction on the web. In NAACL ’ 07: Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations on XX, pages 25– 26, Morristown, NJ, USA. Association for Computational Linguistics. [Zhu et al. , 2009] Zhu, J. , Nie, Z. , Liu, X. , Zhang, B. , and rong Wen, J. (2009). Statsnowball: a statistical approach to extracting entity relationships. In WWW 2009 MADRID - Track: Data Mining / Session: Statistical Meth © Franz J. Kurfess 51