33abbce9eccb543a7a38299046daa6af.ppt
- Количество слайдов: 18
Text Analytics A Tool for Taxonomy Development Tom Reamy Chief Knowledge Architect KAPS Group Program Chair – Text Analytics World Knowledge Architecture Professional Services http: //www. kapsgroup. com
Agenda § Introduction § Project: Update ACM taxonomy – after 12+ years § Information Environment § Text Mining / Text Analytics § § Multiple Methods / Reports § Conclusion 2
Introduction: KAPS Group § Knowledge Architecture Professional Services – Network of Consultants § Applied Theory – Faceted & emotion taxonomies, natural categories Services: – Strategy – IM & KM - Text Analytics, Social Media, Integration – Taxonomy/Text Analytics, Social Media development, consulting – Text Analytics Quick Start – Audit, Evaluation, Pilot § Partners – Smart Logic, Expert Systems, SAS, SAP, IBM, FAST, Concept Searching, Attensity, Clarabridge, Lexalytics § Clients: Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, World Bank, Dept. of Transportation, etc. § Program Chair – Text Analytics World – March 29 -April 1 - SF § Presentations, Articles, White Papers – www. kapsgroup. com § Current – Book – Text Analytics: How to Conquer Information Overload, Get Real Value from Social Media, and Add Smart Text to Big Data 3
Introduction: Approach § § Is Automatic Taxonomy Development Here Yet? Not Yet But it is getting closer Hybrid: Taxonomists, SME’s, database analysts, text analysts – Text Mining software – basic text analysis – power – Text analytics software – brains – § New taxonomy terms & structure Old = indexing, authors adding tags & keywords – New = auto-tagging, applications – 4
Information Environment § Existing Taxonomy: Computing Classification System § Content: Database export of Guide to the Computing Literature bibliographic records (. txt; approximately 7 GB in 58 files. ) – Statistical distribution of CCS categories across the Digital Library and Guide to Computing Literature (Excel; 4 files) – ACM Digital Library full text files (PDFs and XML metadata, including CCS categories; approximately 170 GB in 240, 000 files) – Ralston Encyclopedia of Computer Science (PDFs and HTML of each article with XML metadata, including CCS categories; approximately 350 MB in 1, 850 files) – 5
Text Analytics in Taxonomy Development Case Study – Multiple Methods § Text Mining - terms in documents – frequency, date, source, etc. Text Preparation – Create multiple filters Quality – important terms, co-occurring terms Time savings – only feasible way to scan documents Clustering – suggested categories, chunking for editors – Clustering within clusters - explore Entity Extraction – people, organizations, programming languages, hardware/devices, etc. Joint Work Sessions – interactive exploration – § § § 6
Case Study – Taxonomy Development 7
8
Case Study – Taxonomy Development 9
Case Study – Taxonomy Development 10
Case Study – Taxonomy Development 11
Multiple Sets of Reports § Keyword Frequency First Pass – 3, 026 – Total – 508, 941 (Get from Big Database) – Sub-Totals • Year Pre-1998, By Year, By 5 year blocks • Map to other variables – Journals, Authors – basis for communities § Keywords in Abstract/Title § Cluster analysis of keyword-abstract-title § Search Terms in keyword-abstract-title – 12
Entity Extraction – Company, Internet, Organization, Title 13
Multiple Methods - Reports § Spreadsheets – static reports § Database query reports – § § Create multiple slices, views, filters Working reports – eliminate more noise words Multiple mapping – extractions, author tags &keywords Map – frequency in abstracts, titles, articles Search logs – terms and phrases § Date ranges – trend reports – per terms, new words 14
15
16
Conclusions § Auto-taxonomy not here - Yet § Scale requires semi-automated solution § Human effort – initial design, text preparation Now would add more auto-categorization § Human effort – analysis & refinement – of queries, text mining, and taxonomy § Simple taxonomies are better – part of information ecosystem – Lower levels of terms – into auto-tagging rules § Early 2015: New Book: – Text Analytics: Everything You Need to Know to Conquer Information Overload, Mine Social Media for Real Value, and Turn Big Text Into Big Data 17 – Title might be shorter but it will be cover all you need to know –
Questions? Tom Reamy tomr@kapsgroup. com KAPS Group Knowledge Architecture Professional Services http: //www. kapsgroup. com


