Скачать презентацию Washington D C October 25 2005 Semi-Automatic Indexing Скачать презентацию Washington D C October 25 2005 Semi-Automatic Indexing

b45a879c69181c6e741753bdd78c3687.ppt

  • Количество слайдов: 38

Washington D. C. October 25, 2005 Semi-Automatic Indexing of Full Text Biomedical Articles Clifford Washington D. C. October 25, 2005 Semi-Automatic Indexing of Full Text Biomedical Articles Clifford W. Gay Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA

Acknowledgments u Alan R. Aronson, Ph. D. u Mehmet Kayaalp, M. D. , Ph. Acknowledgments u Alan R. Aronson, Ph. D. u Mehmet Kayaalp, M. D. , Ph. D. Lister Hill National Center for Biomedical Communications 2

Outline u Introduction l l l The System: Medical Text Indexer (MTI) The Data: Outline u Introduction l l l The System: Medical Text Indexer (MTI) The Data: Online biomedical journals The Task: Emulate Medline indexing using full text u Results l l l Observations on Pub. Med Central articles Model selection results Recent work Lister Hill National Center for Biomedical Communications 3

Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Emulate Medline indexing using full text Results Observations on Pub. Med Central articles Model selection results Recent work

Why Semi-Automatic Indexing? u U. S. National Library of Medicine indexes 5000 journal titles Why Semi-Automatic Indexing? u U. S. National Library of Medicine indexes 5000 journal titles l l l Supports over 60 million Pub. Med searches each month Has 130 indexers Indexed 570, 000 articles in 2004 n l Will need to index 1, 000 very soon Automated support is helping to meet this demand – MTI was used on 26% of articles in 2004 u More about MTI l Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ. The NLM Indexing Initiative's Medical Text Indexer. Medinfo. 2004; 11(Pt 1): 268 -72. PMID: 15360816 Lister Hill National Center for Biomedical Communications 5

Medical Text Indexer (MTI) Title + Abstract et al. Phrasex Phrases Trigram Phrase Matching Medical Text Indexer (MTI) Title + Abstract et al. Phrasex Phrases Trigram Phrase Matching Pub. Med Related Citations Meta. Map UMLS Concepts Rel. Cits. Restrict to Me. SH Extract Me. SH Headings Postprocessing Ordered list of Me. SH Terms

DCMS with MTI Suggestions Lister Hill National Center for Biomedical Communications 7 DCMS with MTI Suggestions Lister Hill National Center for Biomedical Communications 7

Introduction The System: Medical Text Indexer (MTI) The Data: Online biomedical journals The Task: Introduction The System: Medical Text Indexer (MTI) The Data: Online biomedical journals The Task: Emulate Medline indexing using full text Results Observations on Pub. Med Central articles Model selection results Recent work

Why Full Text? u Medical Text Indexer uses article title and abstract u However Why Full Text? u Medical Text Indexer uses article title and abstract u However l Human indexers taught not to use abstract l Author’s complete intent may not be in abstract l Check tags may only appear in a table or methods section. u If MTI indexes from full text articles it may l Find central concepts missing from abstract l Identify terms when article has no abstract l More accurately select check tags l Be in better compliance with indexing policy Lister Hill National Center for Biomedical Communications 9

Test Collection Selection u Available online from Pub. Med Central u Consistent XML format Test Collection Selection u Available online from Pub. Med Central u Consistent XML format l Identifies title, abstract, sections, tables, figures, references, etc. u 500 articles from 17 diverse biomedical journals u Did not use: l l l References Graphics Math Lister Hill National Center for Biomedical Communications 10

Test Collection u 5 Clinical journals (165): l l Breast Cancer Research (11) Journal Test Collection u 5 Clinical journals (165): l l Breast Cancer Research (11) Journal of Clinical Microbiology (80) u 3 Organization based journals l l (28): Journal of American Medical Informatics Assoc. (10) Proceeding of the National Academy of Sciences (11) u 9 Journals in other categories: Pharmacology (65); Biochemistry (65); Plants (46); Molecular Biology (45); Learning (30); Hospitals (22) Lister Hill National Center for Biomedical Communications 11

Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Emulate Medline indexing using full text Results Observations on Pub. Med Central articles Model selection results Recent work

Indexing Task Lister Hill National Center for Biomedical Communications 13 Indexing Task Lister Hill National Center for Biomedical Communications 13

Example Article u Medline Indexing beta-Lactamases /*genetics /*metabolism Enterobacteriaceae/drug effects /*enzymology/genetics Plasmids/*genetics Genes, Bacterial/genetics Example Article u Medline Indexing beta-Lactamases /*genetics /*metabolism Enterobacteriaceae/drug effects /*enzymology/genetics Plasmids/*genetics Genes, Bacterial/genetics Genotype Kinetics Microbial Sensitivity Tests Molecular Sequence Data Research Support, Non-U. S. Gov't MTI Indexing beta-Lactamases Plasmids Enterobacteriaceae beta-Lactam Resistance Conjugation, Genetic Cephalosporin Resistance Cefotaxime Nucleotide Sequences Molecular Sequence Data Cephalosporins Chromosomes, Bacterial DNA Transposable Elements Escherichia coli Genes, Bacterial Cloning, Molecular Klebsiella pneumoniae Amino Acid Sequence Microbial Sensitivity Tests Cephalothin Proteus mirabilis Erwinia Salmonella typhimurium Enterobacteriaceae Infections Lactams • MMI • REL • MMI & REL Recall = 0. 67 Precison = 0. 24 F 2 measure = 0. 492

Evaluation u F 2 l l l Measure Weighted harmonic mean of Recall and Evaluation u F 2 l l l Measure Weighted harmonic mean of Recall and Precision Weights Recall twice as important as Precision Values: 0. 0 to 1. 0 u Computed for each article and averaged Lister Hill National Center for Biomedical Communications 15

Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Emulate Medline indexing using full text Results Observations on Pub. Med Central articles Model selection results Recent work

Section Header Classes u Semantically equivalent section headers u MATERIALS AND METHODS class: l Section Header Classes u Semantically equivalent section headers u MATERIALS AND METHODS class: l l l Materials and Method(s) Scoring Methods Experimental Procedures Other Methods Tested u CAPTIONS class: l the titles and captions from tables and figures Lister Hill National Center for Biomedical Communications 17

Section Class Performance Section Class CAPTIONS ABSTRACT INTRODUCTION Average F 2 0. 3175 0. Section Class Performance Section Class CAPTIONS ABSTRACT INTRODUCTION Average F 2 0. 3175 0. 2960 0. 2869 RESULTS DISCUSSION NO HEADER … CONCLUSIONS ABBREVIATIONS 0. 2790 0. 2734 0. 2574 … 0. 1961 0. 1304 Lister Hill National Center for Biomedical Communications 18

Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Emulate Medline indexing using full text Results Observations on Pub. Med Central articles Model selection results Recent work

Experiments u Varied MTI components used l Meta. Map Indexing (MMI) l Related Citations Experiments u Varied MTI components used l Meta. Map Indexing (MMI) l Related Citations (REL) u Varied section classes processed l Used model selection l Used binary weighting for sections u A model is l A selection of section classes and l The text in those sections l That represents the article Lister Hill National Center for Biomedical Communications 20

Production Baseline MMI Title+Abstract REL F 2 = 0. 457 Lister Hill National Center Production Baseline MMI Title+Abstract REL F 2 = 0. 457 Lister Hill National Center for Biomedical Communications 21

Naive Mode Title+Abstract Materials and Methods MMI Results and Discussion REL No Header All Naive Mode Title+Abstract Materials and Methods MMI Results and Discussion REL No Header All Section Classes F 2 = 0. 453 ( - 0. 9%) Lister Hill National Center for Biomedical Communications 22

Meta. Map Indexing Mode Title+Abstract Captions Introduction MMI Results Discussion REL Other No Header Meta. Map Indexing Mode Title+Abstract Captions Introduction MMI Results Discussion REL Other No Header F 2 = 0. 373 (-18. 4%) Lister Hill National Center for Biomedical Communications 23

Augmented Mode Captions Introduction MMI Results Discussion Other REL No Header Title+Abstract F 2 Augmented Mode Captions Introduction MMI Results Discussion Other REL No Header Title+Abstract F 2 = 0. 475 (+3. 9%) Lister Hill National Center for Biomedical Communications 24

Refined Augmented Mode Captions Results MMI Background Title+Abstract REL F 2 = 0. 485 Refined Augmented Mode Captions Results MMI Background Title+Abstract REL F 2 = 0. 485 (+ 6. 1%) Lister Hill National Center for Biomedical Communications 25

Full MTI Mode Title+Abstract Captions Introduction MMI Results Discussion Other No Header MMI model Full MTI Mode Title+Abstract Captions Introduction MMI Results Discussion Other No Header MMI model REL F 2 = 0. 488 (+ 6. 8%) Lister Hill National Center for Biomedical Communications 26

Refined Full MTI Title+Abstract Captions MMI Results and Discussion REL Conclusions No Header F Refined Full MTI Title+Abstract Captions MMI Results and Discussion REL Conclusions No Header F 2 = 0. 491 (+ 7. 4%) Lister Hill National Center for Biomedical Communications 27

MTI Performance Summary Indexing Model Recall Precision Avg. F 2 Production Baseline (Ti, Ab) MTI Performance Summary Indexing Model Recall Precision Avg. F 2 Production Baseline (Ti, Ab) Naive Mode (full text) Augmented Mode (MMI + REL (Ti, Ab)) Augmented Mode (refined) 0. 53 0. 57 0. 59 0. 32 0. 27 0. 29 0. 457 0. 453 0. 475 0. 60 0. 30 0. 485 Full MTI (MMI + REL common sections) Full MTI (refined) 0. 60 0. 30 0. 488 0. 60 0. 31 0. 491 Lister Hill National Center for Biomedical Communications 28

Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Emulate Medline indexing using full text Results Observations on Pub. Med Central articles Model selection results Recent work

Improvement Potential u With current model l No cut off at 25 terms yields Improvement Potential u With current model l No cut off at 25 terms yields maximum recall of 0. 79 u If all good terms prioritized correctly F 2 = 0. 64 l Improvement over baseline 7% 40% l Lister Hill National Center for Biomedical Communications 30

Increase REL Citations u MTI currently uses 10 Related Citations u Optimal number for Increase REL Citations u MTI currently uses 10 Related Citations u Optimal number for full text articles is 15 u Best model confirmed for this setting u Additional Improvement in F 2 = 0. 01 Lister Hill National Center for Biomedical Communications 31

Summarization u Selecting important text before MTI processing u Using Yeh, Ke, Yang, Meng Summarization u Selecting important text before MTI processing u Using Yeh, Ke, Yang, Meng approach u Combines l l Latent Semantic Analysis and Salton’s Text Relationship Map u Start with current model u Document representation includes l l Bag of words Meta. Map identified concepts Lister Hill National Center for Biomedical Communications 32

NLM Indexing Initiative Contact: cliff@nlm. nih. gov Web: ii. nlm. nih. gov/fulltext. shtml Clifford NLM Indexing Initiative Contact: [email protected] nih. gov Web: ii. nlm. nih. gov/fulltext. shtml Clifford W. Gay Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA

NONE Sections u Most appear in articles that have no abstract l 20/23 u NONE Sections u Most appear in articles that have no abstract l 20/23 u Some are errors l l 4 have “Introduction” header in publisher version 2 appear within other sections with headers. u Many contain the primary text of the article l Comments, Editorials, Letters (11/23) Lister Hill National Center for Biomedical Communications 34

Other Sections Other section class has 525 sections (16%) u Non-standard article organization u Other Sections Other section class has 525 sections (16%) u Non-standard article organization u l u Common in Review articles Example l ß-Lactamases of Kluyvera ascorbata, Probable Progenitors of Some Plasmid-Encoded CTX-M Types n n n Bacterial strains. Antimicrobial agents and susceptibility testing. Kinetic and IEF analyses. Genetic characterization of bla. KLUA. Genetic environment of bla. KLUA-1. Arguments for mobilization of chromosomal bla. KLUA gene. Lister Hill National Center for Biomedical Communications 35

Ranking Function u Made ranking function for Related Citations more like Meta. Map Indexing. Ranking Function u Made ranking function for Related Citations more like Meta. Map Indexing. u Resulted in a more inclusive model l l Materials and Methods Introduction u F 2 measure = 0. 4865 Lister Hill National Center for Biomedical Communications 36

Tuning Path Weight u Ratio of weights between the two indexing paths l l Tuning Path Weight u Ratio of weights between the two indexing paths l l Meta. Map Indexing – 7 Related Citations – 2 u No improvement possible Lister Hill National Center for Biomedical Communications 37

Partial Weight for Singleton Headers u OTHER section class l l Header is unique Partial Weight for Singleton Headers u OTHER section class l l Header is unique Contain content terms u Gave section class weight between 0 and 1 l l Some recall improvement No collection wide improvement in F 2 Lister Hill National Center for Biomedical Communications 38