Скачать презентацию Text Mining Applications and technologies Sergei Ananyan Megaputer Скачать презентацию Text Mining Applications and technologies Sergei Ananyan Megaputer

1bb977c3c2eb1df24bb4f6f688615eb9.ppt

  • Количество слайдов: 48

Text Mining Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc. www. megaputer. com © Text Mining Applications and technologies Sergei Ananyan Megaputer Intelligence, Inc. www. megaputer. com © 2001 Megaputer intelligence, Inc.

Outline v Definitions and application fields v Text mining functionality v Case study v Outline v Definitions and application fields v Text mining functionality v Case study v Technology v Future developments

Text Mining is a process of n n extracting new, valid, and actionable knowledge Text Mining is a process of n n extracting new, valid, and actionable knowledge dispersed throughout text documents and utilizing this knowledge to better organize information for future reference.

Tasks addressed by TM v v v v Search and retrieval Semantic analysis Clustering Tasks addressed by TM v v v v Search and retrieval Semantic analysis Clustering Categorization Feature extraction Ontology building Dynamic focusing

DM and TM comparison Data Mining Object of investigation Numerical and categorical data Object DM and TM comparison Data Mining Object of investigation Numerical and categorical data Object structure Relational databases Text Mining Texts Free form texts Goal Predict outcomes of future situations Retrieve relevant information, distill the meaning, categorize and target-deliver Methods Machine learning: SKAT, DT, NN, GA, MBR, MBA Indexing, special neural network processing, linguistics, ontologies Current market size 100, 000 analysts at large and midsize companies 100, 000 corporate workers and individual users Maturity Broad implementation since 1994 Broad implementation starting 2000

TM tasks in detail v Information search and retrieval n Index-based • n Ontology-based TM tasks in detail v Information search and retrieval n Index-based • n Ontology-based • • n Hot. Bot, dt-Search Semantics and linguistics enhanced • v Yahoo, Lycos Megaputer – ontology building Boolean search + stemming • n Excite, Alta Vista Megaputer Dymanic focusing • Megaputer

TM tasks in detail v Semantic analysis n Neural network and customized dictionaries • TM tasks in detail v Semantic analysis n Neural network and customized dictionaries • n n Megaputer Bayesian inference • Autonomy Clustering and categorization • v Megaputer, Microsystems Linguistics • v (continued) Megaputer Feature extraction • SRA, Megaputer, IBM

Possible applications v v v Search engines Enterprise portals Knowledge management systems e-Business systems Possible applications v v v Search engines Enterprise portals Knowledge management systems e-Business systems Vertical applications: n n n e-mail categorization and routing Call center notes categorization CRM systems

Typical setups v Venture capitalist n n n v Search and retrieval Estimation of Typical setups v Venture capitalist n n n v Search and retrieval Estimation of relevance Summarization and navigation Investment or Insurance company n n Categorization of incoming messages Target-sharing information with employees Structured fragments extraction (numbers) Feature extraction (who owns whom)

Typical setups v Government agency n n n v (continued) Intelligent infromation retrieval Chain Typical setups v Government agency n n n v (continued) Intelligent infromation retrieval Chain of events tracing Supplement documents by their summaries for more efficient reference e-Business n n n Match resource description to a user query Learn visitor interests by analyzing the content browsed Match interests to available resources

Text and the Web v 99% of analytical information on the Web exists in Text and the Web v 99% of analytical information on the Web exists in the form of texts v The Web is the place where users routinely encounter new texts v 99% of e-Businesses today do not leverage competitive advantage provided by their content-rich websites because they do not utilize text mining to the extend they should

Example: nytimes. com v v v v Extremely rich content Large audience: 10+ mln Example: nytimes. com v v v v Extremely rich content Large audience: 10+ mln e-mails Generates revenue from advertisers Uses an anonymous survey for login Does a very good job tracking individual pages accessed For any page can furnish demographic profile of its visitors But does not utilize text mining. Cannot see customer-centered view.

Example: nytimes. com v v v (continued) Could significantly increase the value of each Example: nytimes. com v v v (continued) Could significantly increase the value of each visitor to advertisers by doing individualized marketing Rich content and high visitor loyalty are ideal for learning visitors’ interests through text mining This silent surveing is done unobtrusively Privacy is preserved Potential result: increased revenue

Megaputer text mining v Text. Analyst* n n Tech: combi of n-grams and Neural Megaputer text mining v Text. Analyst* n n Tech: combi of n-grams and Neural Networks Scope: Analyst’s desktop solution * Microsystems Ltd. , a Megaputer business partner. Megaputer has exclusive distribution rights for Text. Analyst. v Textractor n n Tech: Morphological analysis, Semantic analysis (Word. Net and its extensions), Statistical and Fuzzy Logic analysis) Scope: Enterprise solution

Text. Analyst* Overview * Microsystems Ltd. , a Megaputer business partner. Megaputer has exclusive Text. Analyst* Overview * Microsystems Ltd. , a Megaputer business partner. Megaputer has exclusive distribution rights for Text. Analyst.

Text. Analyst v Text. Analyst is a tool for semantic analysis, navigation, and search Text. Analyst v Text. Analyst is a tool for semantic analysis, navigation, and search of unstructured texts. v Text. Analyst is available as n n Standlone application SDK of COM components for easy integration

Text. Analyst functionality v v v v Distilling the meaning (Semantic Network) Navigation Summarization Text. Analyst functionality v v v v Distilling the meaning (Semantic Network) Navigation Summarization Topic explication Clustering Dynamic focusing Categorization (Text. Analyst COM)

Text. Analyst Customer base: 300+ installations Sample customers Ask Jeeves (USA) Pfizer (USA) IMS Text. Analyst Customer base: 300+ installations Sample customers Ask Jeeves (USA) Pfizer (USA) IMS Health (USA) TRW (USA) The Gallup Organization (USA) Mc. Kinsey & Company (USA) Centers for Disease Control (USA) Liberty Mutual (USA) Best Buy (USA) Logicon (USA) France Telecom (France) Net Shepherd (Canada) Skila. com (USA) Dept of Environmental Protection (Australia) US Navy (USA) KPN Research (Netherlands) Dow Chemical (USA) Talkie. com (USA) Clontech (USA) NICE Systems (Israel)

Text. Analyst Underlying Technology Text. Analyst Underlying Technology

Text image v Semantic Network - a list of the most important concepts (words Text image v Semantic Network - a list of the most important concepts (words and word combinations) and relations between them temperature (95) (70) nuclear reactions (98) (59) (78) Temperature fusion (100) (52) papers (86) nuclear (100) (37) (29) (46) (28) (63) heat (99) cell (98) Peterson (96)

Semantic network creation v v Text is a string of characters: letters, spaces, punctuation Semantic network creation v v Text is a string of characters: letters, spaces, punctuation marks Steps for building Semantic Network n n n Break text in words and sentences Push through a n-character window Feed patterns to a Recurrent Hierarchical Neural Network and record frequencies Identify relations between concepts (joint occurrence in a sentence) Carry out preliminary semantic network renormalization (Hopfield-like Neural Network) assign semantic weights

General & Text-specific tasks v v Parse and reorganize input into sequences of words General & Text-specific tasks v v Parse and reorganize input into sequences of words joined by concatenation and separation signs Recognize and remove auxiliary words and flective morphemes Recognize, count and store stem morphemes Identify words sharing stem morphemes

Hierarchical Recurrent NN Hierarchical Recurrent NN

Hierarchical Recurrent NN Hierarchical Recurrent NN

General & Text-specific tasks v Identify relationships n v v Text - joint occurrence General & Text-specific tasks v Identify relationships n v v Text - joint occurrence in sentences Preliminary SN renormalization: optimization task similar to Hopfield network Association of concepts in SN with sentences and context in original text

Case study v v IRLP provides R&D assistance and information services to Indiana’s small Case study v v IRLP provides R&D assistance and information services to Indiana’s small businesses and governmental units IRLP searches SBIR and the Commerce Business Daily to identify research funding opportunities for its clients. “Text. Analyst was able to find the necessary matches even for those clients where existing search program was incompatible. ” -- Cindy Moore, Marketing Coordinator, IRLP

Customer quotes Customer quotes

Text. Analyst supports medical research at Centers for Disease Control Eleanor Mc. Lellan Data Text. Analyst supports medical research at Centers for Disease Control Eleanor Mc. Lellan Data Manager / Analyst Centers for Disease Control Atlanta, GA "Text. Analyst is able to efficiently handle numerous and often large (90+ pages apiece) text files without any problem. Furthermore, the program is extremely user-friendly. "

Text. Analyst helps processing texts at Clontech Nikolai Kalnin, Ph. D. Team Leader Bioinformatics Text. Analyst helps processing texts at Clontech Nikolai Kalnin, Ph. D. Team Leader Bioinformatics Group CLONTECH Laboratories, Inc. Palo Alto, CA "Text. Analyst has been selected as the only text analysis tool capable of establishing relations between terms. It is reasonably priced, easy to install and operate. "

Text. Analyst saves time and resources for Case. Bank Kalyan Gupta, Ph. D. Director, Text. Analyst saves time and resources for Case. Bank Kalyan Gupta, Ph. D. Director, Research Case. Bank Technologies Inc. Brampton, Ontario "Text. Analyst is used at Case. Bank to identify and assess the contents of electronic repositories of troubleshooting and maintenance information. It saves case preparation time and allows Case. Bank to be more responsive to its customer's knowledge retrieval needs. "

Future developments v v v Text categorization (now implemented in Text. Analyst COM) Thesaurus-based Future developments v v v Text categorization (now implemented in Text. Analyst COM) Thesaurus-based text retrieval Integration with Web technologies

Text. Analyst evaluation We invite you to download a FREE evaluation copy of Text. Text. Analyst evaluation We invite you to download a FREE evaluation copy of Text. Analyst from www. megaputer. com and enjoy using it hands-on following the provided step-by-step lessons, or exploring your own data.

Textractor ™ Technology and Applications Textractor ™ Technology and Applications

Textractor capabilities v v v Key senses extraction Hierarchical clustering Categorization Summarization Intelligent search Textractor capabilities v v v Key senses extraction Hierarchical clustering Categorization Summarization Intelligent search Feature extraction

Textractor applications v General n n Knowledge extraction from call center notes n Knowledge-based Textractor applications v General n n Knowledge extraction from call center notes n Knowledge-based executive reporting system n Flexible searching for support documentation n v Automated email categorization and routing Competitive intelligence (categories can be provided by the user or determined by the system) (example: occupational hazard determination) (one-glance knowledge visualization) (semantic relations between terms: synonyms, hyponyms, meronyms) Insurance n Clustering of claims and ontology building n Automated feature extraction and claim tagging (hierarchical organization of textual data)

Textractor analysis steps v Morphological analysis Syntactic analysis Semantic analysis - Word. Net filtering Textractor analysis steps v Morphological analysis Syntactic analysis Semantic analysis - Word. Net filtering v Statistical analysis v Context Analysis v Semantic Network comparison v v (synonymy, antonymy, hyper/hyponymy and holo/meronymy) (frequency of terms against background frequencies) (polysemy resolving and term collocations)

Word. Net v Word. Net is a comprehensive semantically organized lexical database for English Word. Net v Word. Net is a comprehensive semantically organized lexical database for English www. cogsci. princeton. edu/~wn v Textractor provides an ability to expand edit Word. Net for a specific application field.

Semantic term relationships v Synonyms n v Hyper/Hyponyms n v Car (holonym) : : Semantic term relationships v Synonyms n v Hyper/Hyponyms n v Car (holonym) : : Motor, Windshield, Tire (meronyms) Antonyms n v Bird (hyperym) : Eagle, Hawk, Pigeon (hyponyms) Holo/Meronyms n v Accident – Collision – Wreck Cold <> Hot, Deep <> Shallow Polysemy n Commercial Bank River Bank

Textractor architecture Data sources Text Mining Engines Filters and DW interfaces Core TM engines Textractor architecture Data sources Text Mining Engines Filters and DW interfaces Core TM engines Morphological Analysis Semantic Analysis Word. Net Field-specific Word. Net Extensions Word. Net Extension Editor Stored Indices Syntactic Analysis Link Parser Application-oriented TM engines

Textractor text mining engines Application-oriented TM engines Intelligent Searcher (synonyms, hyper/hyponyms, term proximity, frequencies) Textractor text mining engines Application-oriented TM engines Intelligent Searcher (synonyms, hyper/hyponyms, term proximity, frequencies) Document tagging Core TM engines Application-oriented TM engines Text indexer Text Categorizer Formal search query creator Text Clusterizer Key senses extractor Feature extractor Database enrichment and mining

Any Questions? Call Megaputer at (812) 330 -0110 or write 120 W Seventh Street, Any Questions? Call Megaputer at (812) 330 -0110 or write 120 W Seventh Street, Suite 310 Bloomington, IN 47404 USA info@megaputer. com

Appendix A Text. Analyst technology details Appendix A Text. Analyst technology details

Two aspects of text v Sequence of characters characterized by patterns that represent information Two aspects of text v Sequence of characters characterized by patterns that represent information recognized by humans v Structured sequence of lexical units organized together according to morphological and syntactic rules (morphemes, auxiliary lexical units, syntactic members, sentences, etc. )

Semantics of text v v v Humans rely on multimodal associations for creating semantic Semantics of text v v v Humans rely on multimodal associations for creating semantic models Standalone text - semantics is formal, but still useful Meaning of a concept - collection of relations of this concept to other concepts in the text (constructive definition)

Lexical vs. Grammatical v v Lexical meaning of a word determined by stem morpheme Lexical vs. Grammatical v v Lexical meaning of a word determined by stem morpheme (word combinations - chains of morphemes) Grammatical meaning - determined by morphemes (prefixes, endings, etc. ) and auxiliary semantic units (articles, prepositions, etc. ) v Grammatical chains - word sequences with extracted stem morphemes frames for contents

Semantic structure of texts v v v Single text - semantic analysis can be Semantic structure of texts v v v Single text - semantic analysis can be performed, but is not sufficient: need a knowledge base against which the text can be analyzed Analysis of a large number of texts from diverse fields => Grammatical structure of the language Analysis of a large number of texts from the field of interest => Knowledge Base

Grammatical + Lexical = Semantic v v v Grammatical dictionaries of morphemes and auxiliary Grammatical + Lexical = Semantic v v v Grammatical dictionaries of morphemes and auxiliary words of a language: threshold transformation applied to a NN trained on a large corpus of texts from diverse fields Trained “grammatical NN” - filter. “Lexical” NN is connected to its output. Combining elements from both NN obtain a list of concepts for Semantic Network (after relational renormalization)