b45a879c69181c6e741753bdd78c3687.ppt
- Количество слайдов: 38
Washington D. C. October 25, 2005 Semi-Automatic Indexing of Full Text Biomedical Articles Clifford W. Gay Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA
Acknowledgments u Alan R. Aronson, Ph. D. u Mehmet Kayaalp, M. D. , Ph. D. Lister Hill National Center for Biomedical Communications 2
Outline u Introduction l l l The System: Medical Text Indexer (MTI) The Data: Online biomedical journals The Task: Emulate Medline indexing using full text u Results l l l Observations on Pub. Med Central articles Model selection results Recent work Lister Hill National Center for Biomedical Communications 3
Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Emulate Medline indexing using full text Results Observations on Pub. Med Central articles Model selection results Recent work
Why Semi-Automatic Indexing? u U. S. National Library of Medicine indexes 5000 journal titles l l l Supports over 60 million Pub. Med searches each month Has 130 indexers Indexed 570, 000 articles in 2004 n l Will need to index 1, 000 very soon Automated support is helping to meet this demand – MTI was used on 26% of articles in 2004 u More about MTI l Aronson AR, Mork JG, Gay CW, Humphrey SM, Rogers WJ. The NLM Indexing Initiative's Medical Text Indexer. Medinfo. 2004; 11(Pt 1): 268 -72. PMID: 15360816 Lister Hill National Center for Biomedical Communications 5
Medical Text Indexer (MTI) Title + Abstract et al. Phrasex Phrases Trigram Phrase Matching Pub. Med Related Citations Meta. Map UMLS Concepts Rel. Cits. Restrict to Me. SH Extract Me. SH Headings Postprocessing Ordered list of Me. SH Terms
DCMS with MTI Suggestions Lister Hill National Center for Biomedical Communications 7
Introduction The System: Medical Text Indexer (MTI) The Data: Online biomedical journals The Task: Emulate Medline indexing using full text Results Observations on Pub. Med Central articles Model selection results Recent work
Why Full Text? u Medical Text Indexer uses article title and abstract u However l Human indexers taught not to use abstract l Author’s complete intent may not be in abstract l Check tags may only appear in a table or methods section. u If MTI indexes from full text articles it may l Find central concepts missing from abstract l Identify terms when article has no abstract l More accurately select check tags l Be in better compliance with indexing policy Lister Hill National Center for Biomedical Communications 9
Test Collection Selection u Available online from Pub. Med Central u Consistent XML format l Identifies title, abstract, sections, tables, figures, references, etc. u 500 articles from 17 diverse biomedical journals u Did not use: l l l References Graphics Math Lister Hill National Center for Biomedical Communications 10
Test Collection u 5 Clinical journals (165): l l Breast Cancer Research (11) Journal of Clinical Microbiology (80) u 3 Organization based journals l l (28): Journal of American Medical Informatics Assoc. (10) Proceeding of the National Academy of Sciences (11) u 9 Journals in other categories: Pharmacology (65); Biochemistry (65); Plants (46); Molecular Biology (45); Learning (30); Hospitals (22) Lister Hill National Center for Biomedical Communications 11
Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Emulate Medline indexing using full text Results Observations on Pub. Med Central articles Model selection results Recent work
Indexing Task Lister Hill National Center for Biomedical Communications 13
Example Article u Medline Indexing beta-Lactamases /*genetics /*metabolism Enterobacteriaceae/drug effects /*enzymology/genetics Plasmids/*genetics Genes, Bacterial/genetics Genotype Kinetics Microbial Sensitivity Tests Molecular Sequence Data Research Support, Non-U. S. Gov't MTI Indexing beta-Lactamases Plasmids Enterobacteriaceae beta-Lactam Resistance Conjugation, Genetic Cephalosporin Resistance Cefotaxime Nucleotide Sequences Molecular Sequence Data Cephalosporins Chromosomes, Bacterial DNA Transposable Elements Escherichia coli Genes, Bacterial Cloning, Molecular Klebsiella pneumoniae Amino Acid Sequence Microbial Sensitivity Tests Cephalothin Proteus mirabilis Erwinia Salmonella typhimurium Enterobacteriaceae Infections Lactams • MMI • REL • MMI & REL Recall = 0. 67 Precison = 0. 24 F 2 measure = 0. 492
Evaluation u F 2 l l l Measure Weighted harmonic mean of Recall and Precision Weights Recall twice as important as Precision Values: 0. 0 to 1. 0 u Computed for each article and averaged Lister Hill National Center for Biomedical Communications 15
Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Emulate Medline indexing using full text Results Observations on Pub. Med Central articles Model selection results Recent work
Section Header Classes u Semantically equivalent section headers u MATERIALS AND METHODS class: l l l Materials and Method(s) Scoring Methods Experimental Procedures Other Methods Tested u CAPTIONS class: l the titles and captions from tables and figures Lister Hill National Center for Biomedical Communications 17
Section Class Performance Section Class CAPTIONS ABSTRACT INTRODUCTION Average F 2 0. 3175 0. 2960 0. 2869 RESULTS DISCUSSION NO HEADER … CONCLUSIONS ABBREVIATIONS 0. 2790 0. 2734 0. 2574 … 0. 1961 0. 1304 Lister Hill National Center for Biomedical Communications 18
Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Emulate Medline indexing using full text Results Observations on Pub. Med Central articles Model selection results Recent work
Experiments u Varied MTI components used l Meta. Map Indexing (MMI) l Related Citations (REL) u Varied section classes processed l Used model selection l Used binary weighting for sections u A model is l A selection of section classes and l The text in those sections l That represents the article Lister Hill National Center for Biomedical Communications 20
Production Baseline MMI Title+Abstract REL F 2 = 0. 457 Lister Hill National Center for Biomedical Communications 21
Naive Mode Title+Abstract Materials and Methods MMI Results and Discussion REL No Header All Section Classes F 2 = 0. 453 ( - 0. 9%) Lister Hill National Center for Biomedical Communications 22
Meta. Map Indexing Mode Title+Abstract Captions Introduction MMI Results Discussion REL Other No Header F 2 = 0. 373 (-18. 4%) Lister Hill National Center for Biomedical Communications 23
Augmented Mode Captions Introduction MMI Results Discussion Other REL No Header Title+Abstract F 2 = 0. 475 (+3. 9%) Lister Hill National Center for Biomedical Communications 24
Refined Augmented Mode Captions Results MMI Background Title+Abstract REL F 2 = 0. 485 (+ 6. 1%) Lister Hill National Center for Biomedical Communications 25
Full MTI Mode Title+Abstract Captions Introduction MMI Results Discussion Other No Header MMI model REL F 2 = 0. 488 (+ 6. 8%) Lister Hill National Center for Biomedical Communications 26
Refined Full MTI Title+Abstract Captions MMI Results and Discussion REL Conclusions No Header F 2 = 0. 491 (+ 7. 4%) Lister Hill National Center for Biomedical Communications 27
MTI Performance Summary Indexing Model Recall Precision Avg. F 2 Production Baseline (Ti, Ab) Naive Mode (full text) Augmented Mode (MMI + REL (Ti, Ab)) Augmented Mode (refined) 0. 53 0. 57 0. 59 0. 32 0. 27 0. 29 0. 457 0. 453 0. 475 0. 60 0. 30 0. 485 Full MTI (MMI + REL common sections) Full MTI (refined) 0. 60 0. 30 0. 488 0. 60 0. 31 0. 491 Lister Hill National Center for Biomedical Communications 28
Introduction The System: Medical Text Indexer (MTI) The Data: Online medical journals The Task: Emulate Medline indexing using full text Results Observations on Pub. Med Central articles Model selection results Recent work
Improvement Potential u With current model l No cut off at 25 terms yields maximum recall of 0. 79 u If all good terms prioritized correctly F 2 = 0. 64 l Improvement over baseline 7% 40% l Lister Hill National Center for Biomedical Communications 30
Increase REL Citations u MTI currently uses 10 Related Citations u Optimal number for full text articles is 15 u Best model confirmed for this setting u Additional Improvement in F 2 = 0. 01 Lister Hill National Center for Biomedical Communications 31
Summarization u Selecting important text before MTI processing u Using Yeh, Ke, Yang, Meng approach u Combines l l Latent Semantic Analysis and Salton’s Text Relationship Map u Start with current model u Document representation includes l l Bag of words Meta. Map identified concepts Lister Hill National Center for Biomedical Communications 32
NLM Indexing Initiative Contact: cliff@nlm. nih. gov Web: ii. nlm. nih. gov/fulltext. shtml Clifford W. Gay Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA
NONE Sections u Most appear in articles that have no abstract l 20/23 u Some are errors l l 4 have “Introduction” header in publisher version 2 appear within other sections with headers. u Many contain the primary text of the article l Comments, Editorials, Letters (11/23) Lister Hill National Center for Biomedical Communications 34
Other Sections Other section class has 525 sections (16%) u Non-standard article organization u l u Common in Review articles Example l ß-Lactamases of Kluyvera ascorbata, Probable Progenitors of Some Plasmid-Encoded CTX-M Types n n n Bacterial strains. Antimicrobial agents and susceptibility testing. Kinetic and IEF analyses. Genetic characterization of bla. KLUA. Genetic environment of bla. KLUA-1. Arguments for mobilization of chromosomal bla. KLUA gene. Lister Hill National Center for Biomedical Communications 35
Ranking Function u Made ranking function for Related Citations more like Meta. Map Indexing. u Resulted in a more inclusive model l l Materials and Methods Introduction u F 2 measure = 0. 4865 Lister Hill National Center for Biomedical Communications 36
Tuning Path Weight u Ratio of weights between the two indexing paths l l Meta. Map Indexing – 7 Related Citations – 2 u No improvement possible Lister Hill National Center for Biomedical Communications 37
Partial Weight for Singleton Headers u OTHER section class l l Header is unique Contain content terms u Gave section class weight between 0 and 1 l l Some recall improvement No collection wide improvement in F 2 Lister Hill National Center for Biomedical Communications 38
b45a879c69181c6e741753bdd78c3687.ppt