Скачать презентацию Semantic Mining WP 20 meeting Freiburg March 29 Скачать презентацию Semantic Mining WP 20 meeting Freiburg March 29

912e9e57de4a6bd7b79a07d98cac2201.ppt

  • Количество слайдов: 23

Semantic. Mining WP 20 meeting Freiburg, March 29 – 20, 2004 Semantic. Mining WP 20 meeting Freiburg, March 29 – 20, 2004

Agenda March 29 12: 30 - 13: 30 Lunch 13: 30 Welcome, dicussion of Agenda March 29 12: 30 - 13: 30 Lunch 13: 30 Welcome, dicussion of agenda 13: 50 - 14: 35 Linköping presentation 14: 35 - 15: 20 Brighton presentation 15: 20 - 16: 05 Göteborg presentation 16: 05 - 16: 45 Coffee break 16: 30 - 17: 05 Stockholm presentation 17: 05 - 17: 40 Geneva presentation 17: 40 - 18: 25 Paris presentation 18: 25 - 19: 00 Freiburg presentation 20: 00 Dinner March 30 9: 00 - 10: 30 Discussion of the description of WP 20. 10: 30 – 10: 45 coffee break 11: 00 -12: 45 Workplan for WP 20 Discussion and elaboration of deliverables 13: 00 -14: 00 Lunch

Multi-lingual Medical Dictionary Description of Work (I) The lack of a large-scale multi-lingual medical Multi-lingual Medical Dictionary Description of Work (I) The lack of a large-scale multi-lingual medical dictionary hampers the integration of European research activities in the medical field, and more seriously also the development of multi-lingual information retrieval services. An interesting language technology useful for this problem is corpus-based machine translation. The aim of this project is to develop techniques and systems for lexical data generation from parallel corpora, and to develop and apply methods for evaluation of machine translation systems. Parallel corpora exist e. g. as translations from English to other European languages of the official WHO classifications and some other terminology systems. Several of the No. E partners have extensive experience in multilingual lexical resources and computational lexicography, while others have an interest in applying such tools e. g. for semi-automated translation, semi-automated coding and indexing, and advanced systems for information retrieval.

Multi-lingual Medical Dictionary Description of Work (II) n n Tasks n 20. 1 Facilitating Multi-lingual Medical Dictionary Description of Work (II) n n Tasks n 20. 1 Facilitating short study visits of members of each others’ groups n 20. 2 Sharing and exchange of methods, materials and collaboration on work in progress n 20. 3 Proposal for a common data structure for a multi-lingual medical dictionary n 20. 4 Generation of multi-lingual medical lexicon in English, German, French, Portuguese, Italian, Spanish, Swedish in a range of 4. 00040. 000 entries per language Deliverables n D 20. 1 Report Multi-lingual Medical Dictionary m 11 n D 20. 2 Report Multi-lingual Medical Dictionary m 17

Topics for Discussion n Lexeme features (morphology, syntax, semantics) n Application context (IR, NLG, Topics for Discussion n Lexeme features (morphology, syntax, semantics) n Application context (IR, NLG, …) n Linguistic framework (grammar theory) n Languages covered n Domain (sublanguages, general language, expert vs. layperson language) n Size of the lexicon n Implementation framework (sources, exchange templates, n Interfaces to terminological resources (UMLS, Word. Net) n Methods for lexical acquisition (manual, semi-automatic)

Morpho. Saurus Subword Lexicon & Thesaurus Freiburg University Hospital Department of Medical Informatics Freiburg Morpho. Saurus Subword Lexicon & Thesaurus Freiburg University Hospital Department of Medical Informatics Freiburg University Computational Linguistics Lab

Motivation – Intra- and Crosslingual Indexing for Information Retrieval n Requirements: Elimination of inflectional Motivation – Intra- and Crosslingual Indexing for Information Retrieval n Requirements: Elimination of inflectional e derivational variation: n n Decomposition of compound terms: n n procto|sigmoid|o|scop|ie, para|sympath|ectomy, Rechts|herz|insuffizienz, psic|o|s|somát|ic|o Resolution of Synonyms and Spelling Variants: n n {nucleus, nuclei}, {diagnosis, diagnoses, diagnostic} {foot, feet}, {Lymphozyten, lymphozytär} {oesophagus, esophagus}, {leuko, leuco}, {cutis, skin}, {hemorrhage, bleeding}, {ascorbic, Vitamin C, {ancylostoma, hookworm} Mapping of interlingual synonyms: n {blood, blut, sangue}, {liver, hepat. . . , fígado} {kidney, nephr. . , nefr. . , nier. . , ren, rim, },

What is a subword ? n An atomic linguistic sense unit: n n Morpheme What is a subword ? n An atomic linguistic sense unit: n n Morpheme aggregates: diaphys, ascorb, anabol, diagnost n Words: amyloid, bone, fever, liver n n Morphemes: nephr, anti, thyr, scler, hepat, cardi exceptionally: noun groups: vitamin c, … Taming the growth rates of lexical resources at a sublinear level

Subword Delimitation Criteria n Semantic (compositionality) Hyper | cholesterol | emia n Lexical (enabling Subword Delimitation Criteria n Semantic (compositionality) Hyper | cholesterol | emia n Lexical (enabling synonym matching) schleimhaut = mucosa (schleim | haut) n Data-driven (avoiding ambiguities and false segmentation), e. g. relationship, schwangerschaft (relation|ship, schwanger|schaft)

The Morpho. Saurus system Extracts semantically relevant subwords from medical texts in different language The Morpho. Saurus system Extracts semantically relevant subwords from medical texts in different language n Transforms IR relevant content to conceptlike semantic identifiers. (MID = Morpho. Saurus identifiers) n

Example: High TSH values suggest the diagnosis of primary hypothyroidism. . . Erhöhte TSH-Werte Example: High TSH values suggest the diagnosis of primary hypothyroidism. . . Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypothyreose. . . Original

Example: High TSH values suggest the Orthographic high tsh values suggest the diagnosis of Example: High TSH values suggest the Orthographic high tsh values suggest the diagnosis of primary hypo- Normalization diagnosis of primary hypothyroidism. . . Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypothyreose. . . Original Orthographic Rules erhoehte tsh werte erlauben die diagnose einer primaeren hypothyreose. . .

Example: High TSH values suggest the Orthographic high tsh values suggest the diagnosis of Example: High TSH values suggest the Orthographic high tsh values suggest the diagnosis of primary hypo- Normalization diagnosis of primary hypothyroidism. . . Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypothyreose. . . Original Orthographic Rules erhoehte tsh-werte erlauben die diagnose einer primaeren hypothyreose. . . Morphosyntactic Parser Lexicon high tsh value s suggest the diagnos is of primar y hypo thyroid ism er hoeh te tsh wert e erlaub en die diagnos e einer primaer en hypo thyre ose

Example: High TSH values suggest the Orthographic high tsh values suggest the diagnosis of Example: High TSH values suggest the Orthographic high tsh values suggest the diagnosis of primary hypo- Normalization diagnosis of primary hypothyroidism. . . Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypothyreose. . . Original MID-Representation Orthographic Rules erhoehte tsh-werte erlauben die diagnose einer primaeren hypothyreose. . . Morphosyntactic Parser Lexicon upiiiij tsh value. MID suggest. MID high tsh value s suggest the Semantic diagnostiiiryz primariiiyiy diagnos is of primar y hypo Normalization smalliiiqqi thyreiiprzw thyroid ism upiiiij tsh valueiiqrij permitiji er hoeh te tsh wert e erlaub en Thesaurus diagnostiiiryz primariiiyiy die diagnos e einer primaer en smalliiiqqi thyreiiprzw hypo thyre ose

Example: High TSH values suggest the Orthographic high tsh values suggest the diagnosis of Example: High TSH values suggest the Orthographic high tsh values suggest the diagnosis of primary hypo- Normalization diagnosis of primary hypothyroidism. . . Erhöhte TSH-Werte erlauben die Diagnose einer primären Hypothyreose. . . Original MID-Representation Orthographic Rules erhoehte tsh-werte erlauben die diagnose einer primaeren hypothyreose. . . Morphosyntactic Parser Lexicon upiiij tsh valueiiqrij suggestiipzzr high tsh value s suggest the Semantic diagnostiiiryz primariiiyiy diagnos is of primar y hypo Normalization smalliiiqqi thyreiiprzw thyroid ism upiiij tsh valueiiqrij permitiji er hoeh te tsh wert e erlaub en die Thesaurus diagnostiiiryz primariiiyiy diagnos e einer primaer en hypo smalliiiqqi thyreiiprzw thyre ose

Morphosaurus Thesaurus Features n n Only two semantic relations: Syntagmatical expansion: nephrotomiiqwjja = nephriikwjza Morphosaurus Thesaurus Features n n Only two semantic relations: Syntagmatical expansion: nephrotomiiqwjja = nephriikwjza + tomyiiqjqqa (To avoid known mis-segmentations, e. g. n nephr + oto + mie) Ambiguous readings: seitiiyqyqa = lateraliijwira OR pagerijjrja

Morphoedit Lexicon Editor Morphoedit Lexicon Editor

State of the Project Domain: clinical language and lay expressions, partly n Validated entries: State of the Project Domain: clinical language and lay expressions, partly n Validated entries: n n 21, 397 English, 22, 053 German, 15, 029 Portuguese. Automatically generated entries n 8, 992 Spanish subwords from Portuguese subwords

CLIR Experiments (OHSUMED) n n Manual translation of 106 English queries to German and CLIR Experiments (OHSUMED) n n Manual translation of 106 English queries to German and Portuguese by medical experts Baseline: machine translation/bilingual dictionaries QTR n n n n Google-Translator to re-translate German/Portuguese queries to English additional search in a bilingual lexeme dictionary, derived from the UMLS-Metathesaurus. stemmed by the Porter stemming algorithm / stop word elimination Morpho. Saurus: normalization of queries/documents MSI Boolean search engine: frequency and adjacency measure Results German: QTR: 68%, MSI: 93% Results Portuguese: QTR: 54%, MSI: 62% (RIAO’ 04)

Multilingual Me. SH Mapping n n n n Morpho-semantic normalization of 35, 000 English, Multilingual Me. SH Mapping n n n n Morpho-semantic normalization of 35, 000 English, manual Me. SH annotated Medline abstracts Statistical learning of indexing patterns Using indexing patterns for mapping of normalized English/German/Portuguese texts Results: English: German: Portuguese: agreement with gold standard 33% 30% 27% agreement with human indexers (68%) (62%) (56%) (RIAO’ 04)