Скачать презентацию Overview of Text Mining Expertise SCD Скачать презентацию Overview of Text Mining Expertise SCD

5361a94a131274b73a996ef0abc7dd07.ppt

  • Количество слайдов: 24

Overview of Text Mining Expertise @ SCD Overview of Text Mining Expertise @ SCD

Introduction 4 Text mining team @ SCD 8 Started around 2000 8 Currenty 1 Introduction 4 Text mining team @ SCD 8 Started around 2000 8 Currenty 1 postdoc, 4 Ph. D students 8 Tailored, generic text mining analysis 8 Diverse application areas 8 Several collaborations and projects. 8 Supported by more general • Data mining • Numerical linear algebra • Optimization SCD expertise in a. o. Text Mining @ SCD

Strategic mission 4 To consolidate, deepen and extend SCD’s text mining expertise 4 By Strategic mission 4 To consolidate, deepen and extend SCD’s text mining expertise 4 By combining statistical approaches and domainspecific information 4 To support knowledge discovery through literature analysis in various domains: 8 Bio-informatics 8 Knowledge management 8 Mapping of science and technology 8 Bibliometrics Text Mining @ SCD

Problem setting 4 4 Given a set of documents, compute a representation, called index Problem setting 4 4 Given a set of documents, compute a representation, called index <1 0 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> 4 to retrieve, summarize, classify or cluster them Text Mining @ SCD

Problem setting - 2 8 Text mining goals Information Retrieval 8 Text Document analysis Problem setting - 2 8 Text mining goals Information Retrieval 8 Text Document analysis & Extraction of tokens Information Extraction mining methodology Shallow Statistics 8 Overall Problem specific Shallow Parsing Full NLP parsing Domainspecific Generic approach Text Mining @ SCD

Overview 4 Bio-informatics 4 Knowledge management 4 Bibliometrics & scientometrics Text Mining @ SCD Overview 4 Bio-informatics 4 Knowledge management 4 Bibliometrics & scientometrics Text Mining @ SCD

Overview 4 Bio-informatics 4 Knowledge management 4 Bibliometrics & scientometrics Text Mining @ SCD Overview 4 Bio-informatics 4 Knowledge management 4 Bibliometrics & scientometrics Text Mining @ SCD

Document-centered mining 4 4 Given a set of documents, compute a representation, called index Document-centered mining 4 4 Given a set of documents, compute a representation, called index <1 0 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> 4 to retrieve, summarize, classify or cluster them Text Mining @ SCD

Gene-centered mining 4 4 Given a set of genes (and their literature), compute a Gene-centered mining 4 4 Given a set of genes (and their literature), compute a representation, called gene index <1 0 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> 4 to retrieve, summarize, classify or cluster them Text Mining @ SCD

Patient-centered mining 4 4 Given a set of patients (and their records), compute a Patient-centered mining 4 4 Given a set of patients (and their records), compute a representation, called patient index <1 0 0 1> <1 1 0 0 0 1> <0 0 0 1 1 0> 4 to retrieve, classify them 4 . . and/or associate this information to genes Text Mining @ SCD

Functional genomics : gene profiling Bert Coessens 4 Profile documents, genes, … using vocabularies Functional genomics : gene profiling Bert Coessens 4 Profile documents, genes, … using vocabularies (bag of words approach) gene T 3 T 1 T 2 vocabulary 4 Tailored vocabularies reflect the 'knowledge' of a certain domain: + noise reduction (i. e. irrelevant words) + direct link with other knowledge bases (eg. Gene Ontology) Text Mining @ SCD

Functional Genomics - TXTGate Bert Coessens; Steven Van Vooren Distance matrix & Clustering Other Functional Genomics - TXTGate Bert Coessens; Steven Van Vooren Distance matrix & Clustering Other vocabulary Text Mining @ SCD

Functional genomics – Networks from literature Bert Coessens; Frizo Janssens 4 gene networks 4 Functional genomics – Networks from literature Bert Coessens; Frizo Janssens 4 gene networks 4 term networks Text Mining @ SCD

Human genetics Steven Van Vooren 4 Collaboration with Human Genetics Centre @ University Hospital Human genetics Steven Van Vooren 4 Collaboration with Human Genetics Centre @ University Hospital KU Leuven. § Mining on clinical profile and chromosomal footprint of patients (CGH microarrays) § Knowledge discovery for genomic annotation § Aiming at tools and standards for reporting, data entry and visualisation supporting experts in exploring hypotheses in linking phenotypes to genotypes and in inference of novel gene candidates Data Analysis Text Analysis NLP; Ontologies Text Mining @ SCD

Human genetics Steven Van Vooren 4 Knowledge discovery for genomic annotation From µA-CGH profiles Human genetics Steven Van Vooren 4 Knowledge discovery for genomic annotation From µA-CGH profiles • From Biomedical text • 4 Similarity measures for biomedical text what: patient records, literature, genes, loci, clones why: retrieval, clustering, inference • Clustering similar patients, genes, loci, documents • Finding genes associated by patient records 4 Extracting • 4 Text • entities from text gene name symbols, loci, diseases, phenotypes, clinical entities, karyotypes summarization Profiling of patients, genes, loci, clones, clusters of ~. Text Mining @ SCD

Overview 4 Bio-informatics 4 Knowledge management 4 Bibliometrics & scientometrics Text Mining @ SCD Overview 4 Bio-informatics 4 Knowledge management 4 Bibliometrics & scientometrics Text Mining @ SCD

Mc. Know Project Dries Van Dromme; Frizo Janssens Automated and User-oriented Methods and algorithms Mc. Know Project Dries Van Dromme; Frizo Janssens Automated and User-oriented Methods and algorithms for knowledge management 4 4 Collaboration with Center for Industrial Management, KUL Clustering and classification are focal points, as well as scalability because of the huge corpora of available data nowadays. We incorporate user profiles, and as such regard both users and documents as points in a high-dimensional vector space. Furthermore, as environments are typically dynamical, care is taken that used methods are easily updatable. Text Mining @ SCD

Case studies knowledge management 4 Dimensionality of clustered text-mining cases: 8 Dries Van Dromme Case studies knowledge management 4 Dimensionality of clustered text-mining cases: 8 Dries Van Dromme sista papers electronically available publications (ps, pdf) – full text • 1024 x 49. 237 • 8 De Standaard full text newspaper articles, but a lot of them very short • 1776 x 39. 363 - but much more data available • 8 kuleuven papers electronically available papers pertaining to researchers from different departments (pdf, word, . . . ) • 576 x 68. 257 ! less documents, broader spectrum • 8 patent abstracts international patent abstracts and titles • 16. 488 x 21. 019 ! a lot more doc’s, denser spectrum • 8 PMA papers full text publications of the K. U. Leuven dept. of Mechanics • 380 x 18. 206 • 8 Locuslink “known genes with proteins” gene documents from MEDLINE abstracts • 12. 263 x 58. 924 • Text Mining @ SCD

Overview 4 Bio-informatics 4 Knowledge management 4 Bibliometrics & scientometrics Text Mining @ SCD Overview 4 Bio-informatics 4 Knowledge management 4 Bibliometrics & scientometrics Text Mining @ SCD

Scope 4 Bibliometrics the application of mathematical and statistical methods to books and other Scope 4 Bibliometrics the application of mathematical and statistical methods to books and other media of communication 4 Scientometrics the application of those quantitative methods which are dealing with the analysis of science viewed as an information process 4 Patent analysis and mining The analysis of patent information is considered to be one of the best established, directly available and historically reliable methods of quantifying the output of a science and technology system 4 Collaboration with Steunpunt O&O Statistieken << to consolidate and to further develop Flanders position as a European innovation intensive region >> Text Mining @ SCD

Projects Dries Van Dromme; Frizo Janssens 4 1. Domain Analysis 8 Mapping of Nanotechnology Projects Dries Van Dromme; Frizo Janssens 4 1. Domain Analysis 8 Mapping of Nanotechnology field from USPTO/EPO patents Text-based clustering ; identification of sub-domains • comparison with IPC (International Patent Classification) • comparison with FTC (Fraunhofer Technology Classification) • 4 2. Science-Technology mapping 8 link scientific publications (Wo. S) and new technologies (patents) text-based clustering & analysis of citation network structure • Case study: Ljung • 4 3. Trend Detection 8 assess trends & emerging fields from “change over time” in structure and characterization of clusters & citation network Text Mining @ SCD

Software 4 Preprocessing 8 Lucene 4 Search &Indexing & Text. Pack engine and webservices Software 4 Preprocessing 8 Lucene 4 Search &Indexing & Text. Pack engine and webservices 8 TXTGate and Mc. Know Text Mining @ SCD

Publications targetted submissions by Dec 4 Bio-informatics (1 -2) 8 (BMC) bioinformatics, special issues, Publications targetted submissions by Dec 4 Bio-informatics (1 -2) 8 (BMC) bioinformatics, special issues, . . (BC) 8 More biological journals (BC, SVV) 4 Knowledge 8 management (1) Scientometrics, SIAM DM, 4 Bibliometrics & scientometrics (1) Case study Bioinformatics, Trends in. . 8 IEEE transactions, engineering, webmining journals 8 SIAM DM 8 High, moderate, fair impact Text Mining @ SCD

Collaborations 4 Formalized 8 GBOU-Mc. Know partner CIB olv Joost Duflou (Joris Vertommen, Dries Collaborations 4 Formalized 8 GBOU-Mc. Know partner CIB olv Joost Duflou (Joris Vertommen, Dries Cleymans) • User Committee (ICMS, Verhaert, LMS, Tri. Soft, WTCM) • IWT met Joris V (Steven: aanvullen/corrigeren) 8 Steunpunt O&O Statistieken, INCENTIM 8 • 4 Patent clustering and detection of emerging trends Informal M-F Moens (SBO ? ) 8 IBM – Bart VL 8 Gasthuisberg en Peter M: TXTGate als ‘vak’ 8 J&J 8 Text Mining @ SCD