Скачать презентацию and Tools for exploring the biomedical information landscape Скачать презентацию and Tools for exploring the biomedical information landscape

232061cb8d8702a10653c214cffd8cb7.ppt

  • Количество слайдов: 31

and Tools for exploring the biomedical information landscape Les Grivell EMBO Electronic Information Programme and Tools for exploring the biomedical information landscape Les Grivell EMBO Electronic Information Programme EAHIL 2004, Santander,

Electronic information programme Online research information environment for the life sciences A next generation Electronic information programme Online research information environment for the life sciences A next generation information service for the life sciences [email protected] Life Sciences Mobility Portal

But first, let me take you back – not to Altomira, but to the But first, let me take you back – not to Altomira, but to the …… early days of scientific publishing (pre- impact factor)

When libraries were comfortable places that had everything you needed … When libraries were comfortable places that had everything you needed …

and it was possible to keep track of the literature …. (more or less) and it was possible to keep track of the literature …. (more or less) …

Where are we now? – Publishing is big business • STM publishing is a Where are we now? – Publishing is big business • STM publishing is a multi-billion EUR activity (In the UK alone, GBP 22 billion in 2000) • Estimated 164000 scientific periodicals worldwide; around 16% of these are online

– Core science; core journals • Pub. Med lists some 4600 journals in biomedical – Core science; core journals • Pub. Med lists some 4600 journals in biomedical disciplines • As of 19 Sept 2004, 4429 of these are online • The Pub. Med database provides access to circa 15 million abstracts (but if you can’t be found, you won’t be read …) • The Science Citation Index lists 5876 journals with impact factors ranging from 54. 45 – 0. 00. (you’ve been found, but are you worth reading? …)

Another information explosion: genomics 35 Base pairs (billions) 30 25 Sequence entries in the Another information explosion: genomics 35 Base pairs (billions) 30 25 Sequence entries in the EMBL DNA database 20 15 10 Morowitz 5 0 1985 1990 Year 1995 2000 2005

Raw sequences are not the only form of digital information Raw sequences are not the only form of digital information

The nice thing about biological information resources is that there are so many …. The nice thing about biological information resources is that there are so many …. . • Hundreds of different databases, many in flatfile format • A variety of user interfaces • General lack of interoperability

Wouldn’t it be nice to …… find all published literature references for a large Wouldn’t it be nice to …… find all published literature references for a large set of gene symbols and explore their relationships? Micro-array chip Co-regulated genes Find literature Database lookup Discover relationships

This is not really such a novel idea …. This is not really such a novel idea ….

Fritz Saxl (1890– 1948) ‘Ich will nicht, dass in der Bibliothek I don’t want Fritz Saxl (1890– 1948) ‘Ich will nicht, dass in der Bibliothek I don’t want there to be endless ewig gesucht wird! Dieses Suchen kostet searching in the library! It is at the Nerven und die dürfen nicht expense of nerves and these verschwendet werden an solche should not be wasted on such Dummheiten. . . stupidities…. Aby Warburg (1866– 1929)

Saxl & Warburg: Mnemosyne Atlas Saxl & Warburg: Mnemosyne Atlas

Some text search engines Bibliographic databases Biosi s Full text / web-pages Some text search engines Bibliographic databases Biosi s Full text / web-pages

Pubmed Text-based! No direct linkage to other datasets Search only title, authors, abstract Boolean Pubmed Text-based! No direct linkage to other datasets Search only title, authors, abstract Boolean keyword search (AND / OR) Search language is English All documents stored and indexed in one location No ranking on relevance to query!

main features • Ability to interconnect literature articles with different types of molecular data, main features • Ability to interconnect literature articles with different types of molecular data, including images • Ability to search through and retrieve journal articles and other full text documents, even when in different physical locations • Ability to support multi-lingual documents and queries • Services free to the academic community Features implemented via conceptual fingerprinting A discovery tool

conceptual fingerprints Full text document Index and link index terms to (multi-lingual) thesauri • conceptual fingerprints Full text document Index and link index terms to (multi-lingual) thesauri • 1 conceptual fingerprint (CFP) = 400 bytes • Abstraction: 250. 000 pages/PC/day • Matching: 500. 000 CFP’s: 40 millisec. Fingerprint database

prototypes • Initial prototypes in September 2002 and July 2003 • Current prototype online prototypes • Initial prototypes in September 2002 and July 2003 • Current prototype online since 1 st March 2004 • Next launch due mid. October 2004

E-Bio. Sci Content selection: abstracts + full text Choose search focus Full text query E-Bio. Sci Content selection: abstracts + full text Choose search focus Full text query in English, French or German. Is fingerprinted for search

… and now a word about 8 partners ( DE, ES, FR, UK) (Platform) … and now a word about 8 partners ( DE, ES, FR, UK) (Platform) 13 partners (ES, FR, IT, NL, UK) (Research project)

Oriel’s aims Oriel’s aims

Wouldn’t it be nice to be able to navigate from an image to literature Wouldn’t it be nice to be able to navigate from an image to literature and molecular databases? www. bioimage. org (Dr David Shotton, Univ. Oxford)

Gene symbol identification in text Text containing symbols Gene symbol identification in text Text containing symbols

Improved literature – molecular dataset linkage PEO 1 Twinkle, twinkle, little star, How I Improved literature – molecular dataset linkage PEO 1 Twinkle, twinkle, little star, How I wonder what you are. Up above the world so high, Like a diamond in the sky. Twinkle, twinkle, little star, How I wonder what you are GUCY 2 C TYRO 3 CD 44

Problems in gene symbol recognition • Many gene symbols are indistinguishable from everyday words Problems in gene symbol recognition • Many gene symbols are indistinguishable from everyday words or abbreviations • Synonyms • Homonym synonyms (ELK 1 = SAP 1; CAR 1 = SAP 1; BD-2 = SAP 1; RIP 1_SAPOF = SAP 1)

Word-“processing” e gen pr DA FR ote in de ple tax tion in fra Word-“processing” e gen pr DA FR ote in de ple tax tion in fra ase req e dis uire d p h 1 Ya s ate iv act

Natural language processing Natural language processing

Protein interaction networks ataxia requires Yfh 1 regulates Ssc 1 Isu 1 interacts activates Protein interaction networks ataxia requires Yfh 1 regulates Ssc 1 Isu 1 interacts activates Oct 1

Hoffman & Valencia (Madrid) Hoffman & Valencia (Madrid)

Some web-addresses http: //www. e-biosci. org http: //www. oriel. org http: //www. bioimage. org Some web-addresses http: //www. e-biosci. org http: //www. oriel. org http: //www. bioimage. org http: //www. pdg. cnb. uam. es/Uni. Pub/i. HOP/