Скачать презентацию Text Analytics Workshop Applications Tom Reamy Chief Knowledge Скачать презентацию Text Analytics Workshop Applications Tom Reamy Chief Knowledge

9dbe5d59e488c6d4ffe3afc15307750f.ppt

  • Количество слайдов: 24

Text Analytics Workshop Applications Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Text Analytics Workshop Applications Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http: //www. kapsgroup. com

Agenda § Text Analytics Applications Integration with Search –Faceted Navigation – Integration with ECM Agenda § Text Analytics Applications Integration with Search –Faceted Navigation – Integration with ECM – • Metadata • Auto-categorization – Platform for Information Applications • Enterprise – internal and external • Commercial • Structure for Social 2

Text Analytics and Search - Elements § § § Facet – orthogonal dimension of Text Analytics and Search - Elements § § § Facet – orthogonal dimension of metadata Entity / Noun Phrase – metadata value of a facet Entity extraction – feeds facets, signature, ontologies Taxonomy and categorization rules Auto-categorization – aboutness, subject facets People – tagging, evaluating tags, fine tune rules and taxonomy 3

Essentials of Facets § Facets are not categories Categories are what a document is Essentials of Facets § Facets are not categories Categories are what a document is about – limited number – Entities are contained within a document – any number – § Facets are orthogonal – mutually exclusive – dimensions – An event is not a person is not a document is not a place. § Facets – variety – of units, of structure Numerical range (price), Location – big to small – Alphabetical, Hierarchical – taxonomic – § Facets are designed to be used in combination • Wine where color = red, price = excessive, location = Calirfornia, • And sentiment = snotty 4

Advantages of Faceted Navigation § More intuitive – easy to guess what is behind Advantages of Faceted Navigation § More intuitive – easy to guess what is behind each door • Simplicity of internal organization • 20 questions – we know and use § Dynamic selection of categories • Allow multiple perspectives • Ability to Handle Compound Subjects § Systematic Advantages – fewer elements – – 4 facets of 10 nodes = 10, 000 node taxonomy Ability to Handle Compound Subjects § Flexible – can be combined with other navigation elements 5

Developing Facets: Tools and Techniques Software Tools – Entity Extraction § Dictionaries – variety Developing Facets: Tools and Techniques Software Tools – Entity Extraction § Dictionaries – variety of entities, coverage, specialty Cost of update – service or in-house – Inxight – 50+ predefined entity types – Nstein – 800, 000 people, 700, 000 locations, 400, 000 organizations – § Rules Capitalization, text – Mr. , Inc. – Advanced – proximity and frequency of actions, associations – Need people to continually refine the rules – § Entities and Categorization Total number and pattern of entities = a type of aboutness of the document – Bar Code, Fingerprint – SAS – integration of entities (concepts) and categorization – 6

Three Environments § E-Commerce Catalogs, small uniform collections of entities – Uniform behavior – Three Environments § E-Commerce Catalogs, small uniform collections of entities – Uniform behavior – buy this – § Enterprise More content, more types of content – Enterprise Tools – Search, ECM – Publishing Process – tagging, metadata standards – § Internet Wildly different amount and type of content, no taggers – General Purpose – Flickr, Yahoo – Vertical Portal – selected content, no taggers – 7

Three Environments: E-Commerce 8 Three Environments: E-Commerce 8

Three Environments: E-Commerce 9 Three Environments: E-Commerce 9

Enterprise Environment – When and how add metadata § Enterprise Content – different world Enterprise Environment – When and how add metadata § Enterprise Content – different world than e. Commerce More Content, more kinds, more unstructured – Not a catalog to start – less metadata and structured content – Complexity -- not just content but variety of users and activities – § Combination of human and automatic metadata – ECM – Software aided - suggestions, entities, ontologies § Enterprise – Question of Balance / strategy More facets = more findability (up to a point) – Fewer facets = lower cost to tag documents – § Issues Not enough facets – Wrong set of facets – business not information – Ill-defined facets – too complex internal structure – 10

Facets and Taxonomies Enterprise Environment –Taxonomy, 7 facets § Taxonomy of Subjects / Disciplines: Facets and Taxonomies Enterprise Environment –Taxonomy, 7 facets § Taxonomy of Subjects / Disciplines: – Science > Marine microbiology > Marine toxins § Facets: – – – – Organization > Division > Group Clients > Federal > EPA Instruments > Environmental Testing > Ocean Analysis > Vehicle Facilities > Division > Location > Building X Methods > Social > Population Study Materials > Compounds > Chemicals Content Type – Knowledge Asset > Proposals 11

External Environment – Text Mining, Vertical Portals § Internet Content Scale – impacts design External Environment – Text Mining, Vertical Portals § Internet Content Scale – impacts design and technology – speed of indexing – Limited control – Association of publishers to selection of content to none – Major subtypes – different rules – metadata and results – § Complex queries and alerts – Terrorism taxonomy + geography + people + organizations § Text Mining General or specific content and facets and categories – Dedicated tools or component of Portal – internal or external – § Vertical Portal Relatively homogenous content and users – General range of questions – 12

Internet Design § Subject Matter taxonomy – Business Topics – Finance > Currency > Internet Design § Subject Matter taxonomy – Business Topics – Finance > Currency > Exchange Rates § Facets – – – Location > Western World > United States People – Alphabetical and/or Topical - Organization > Corporation > Car Manufacturing > Ford Date – Absolute or range (1 -1 -01 to 1 -1 -08, last 30 days) Publisher – Alphabetical and/or Topical – Organization Content Type – list – newspapers, financial reports, etc. 13

14 14

15 15

16 16

Integrated Facet Application Design Issues - General § What is the right combination of Integrated Facet Application Design Issues - General § What is the right combination of elements? – Faceted navigation, metadata, browse, search, categorized search results, file plan § What is the right balance of elements? Dominant dimension or equal facets – Browse topics and filter by facet – § When to combine search, topics, and facets? Search first and then filter by topics / facet – Browse/facet front end with a search box – 17

Integrated Facet Application Design Issues - General § Homogeneity of Audience and Content § Integrated Facet Application Design Issues - General § Homogeneity of Audience and Content § Model of the Domain – broad How many facets do you need? – More facets and let users decide – Allow for customization – can’t define a single set § User Analysis – tasks, labeling, communities – • Issue – labels that people use to describe their business and label that they use to find information § Match the structure to domain and task – Users can understand different structures 18

Automatic Facets – Special Issues § Scale requires more automated solutions – More sophisticated Automatic Facets – Special Issues § Scale requires more automated solutions – More sophisticated rules § Rules to find and populate existing metadata Variety of types of existing metadata – Publisher, title, date – Multiple implementation Standards – Last Name, First / First Name, Last – § Issue of disambiguation: Same person, different name – Henry Ford, Mr. Ford, Henry X. Ford – Same word, different entity – Ford and Ford – § Number of entities and thresholds per results set / document – Usability, audience needs § Relevance Ranking – number of entities, rank of facets 19

Putting it all together – Infrastructure Solution § Facets, Taxonomies, Software, People § Combine Putting it all together – Infrastructure Solution § Facets, Taxonomies, Software, People § Combine formal power with ability to support multiple § § § user perspectives Facet System – interdependent, map of domain Entity extraction – feeds facets, signatures, ontologies Taxonomy & Auto-categorization – aboutness, subject People – tagging, evaluating tags, fine tune rules and taxonomy The future is the combination of simple facets with rich taxonomies with complex semantics / ontologies 20

Putting it all together – Infrastructure Solution § Integration with ECM – Central Team Putting it all together – Infrastructure Solution § Integration with ECM – Central Team – • Metadata – Create dictionaries of entities • Develop text analytics catalogs – Publishing Process • Software suggests entities, categorization • Authors task is simple – yes or no, not think of keyword § Enterprise Search Integrate at metadata level – build advanced presentation and refine results – Integrate into relevance – 21

Text Analytics Platform – Multiple Applications § Platform for Information Applications – – – Text Analytics Platform – Multiple Applications § Platform for Information Applications – – – Content Aggregation Duplicate Documents – save millions! Text Mining – BI, CI – sentiment analysis Social – Hybrid folksonomy / taxonomy / auto-metadata Social – expertise, categorize tweets and blogs, reputation Ontology – travel assistant – SIRI § Use your Imagination! 22

Text Analytics Platform – Multiple Applications § SIRI – Travel Assistant 23 Text Analytics Platform – Multiple Applications § SIRI – Travel Assistant 23

Questions? Tom Reamy tomr@kapsgroup. com KAPS Group Knowledge Architecture Professional Services http: //www. kapsgroup. Questions? Tom Reamy tomr@kapsgroup. com KAPS Group Knowledge Architecture Professional Services http: //www. kapsgroup. com