
416813699c7783bd0239bfe92f056c5f.ppt
- Количество слайдов: 24
Concept Modeling in Bio-informatics Sanida Omerovic*, Saso Tomazic*, Mateo Valero**, Milos Milovanovic**, David Torrents** *University of Ljubljana, Slovenia ** UPC, Barcelona, Catalonia /24 IPSI Firence-2007
WHAT IS CONCEPT /24 ? 2
Decision Making Algorithm /24 3
Concept Modeling Layer n n n /24 What is concept? How is it modeled? How is it built? How is it exploited? How is it updated? 4
Classification of concept modeling (CM) and decision making systems (DMS) n n This classification is made based on the following assumption: Any decision making system, regardless if the process is performed entirely by humans, supported by machines or totally automated, is a layered process, with one layer (explicit or implicit) which can be called Concept Modeling Layer /24 5
Purpose (DMS) General n Specialized (Bio-informatics) n /24 6
Bio-informatics n n /24 Genomic researchers mostly deal with similarity issues between genomic sequences. Genomic sequences are treated as long sequences of letters: A (adenine) G (Guanine) C (Cytosine) T (Thymine) which represents nitrogenous bases in protein structure. 7
DNA sequence is presented as an array of letters which are mapping the nucleotides in DNA (consisted of one of four types of nitrogenous bases A/G/C/T, a five-carbon sugar, and molecule of phosphoric acid). (A) (T) (G) /24 (C) DNA chemistry compound DNA sequence 8
DNA sequence analysis n GATTCATCGA CCATCAAAT GATT Useful data Noisy data Start sequence End sequence /24 9
Bio-informatics in DMS n Sequence concept (still impossible/there is no protein conceptual model) n Sequence analysis n Sequence retrieval n Sequencing /24 (software BLAST, Smith Waterman, FASTA, etc) (easy/ available for free on the WEB: ENSEMBL. ORG, NCBI, UCSC, etc. ) (hard/laboratory work on the level of chemical reactions to conclude weather C/T/G/A is in question in DNA chain) 10
Sequence analysis In the example shown at next two figures, one can see a fraction of the results obtained from a BLAST comparison of protein SLC 7 A 7 (human) against a Swiss. Prot database of proteins. n We selected two illustrative examples that show from a perfect (word) mach to a similar mach. n /24 11
>gi|12643348|sp|Q 9 UHI 5|LAT 2_HUMAN <http: //www. ncbi. nlm. nih. gov/entrez/query. fcgi? cmd=Retrieve&db=Protein&list_uids=12643348&dopt=Gen. Pept> Gene info <http: //www. ncbi. nlm. nih. gov/entrez/query. fcgi? db=gene&cmd=search&term=12643348%5 BPUID%5 D> Large neutral amino acids transporter small subunit 2 (L-type amino acid transporter 2) (h. LAT 2) Length=535 Score = 665 bits (1717), Expect = 0. 0, Method: Composition-based stats. Identities = 332/332 (100%), Positives = 332/332 (100%), Gaps = 0/332 (0%) Query 1 MGIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFLQGSFAYGGWNFLNYVTEELVDPYK 60 MGIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFLQGSFAYGGWNFLNYVTEELVDPYK Sbjct 204 MGIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFLQGSFAYGGWNFLNYVTEELVDPYK 263 Query 61 NLPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASNAVAVTFGEKLLGVMAWIMPISVA 120 NLPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASNAVAVTFGEKLLGVMAWIMPISVA Sbjct 264 NLPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASNAVAVTFGEKLLGVMAWIMPISVA 323 Query 121 LSTFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVKRCTPIPALLFTCISTLLMLVTSD 180 LSTFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVKRCTPIPALLFTCISTLLMLVTSD Sbjct 324 LSTFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVKRCTPIPALLFTCISTLLMLVTSD 383 Query 181 MYTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPIKINLLFPIIYLLFWAFLLVFSLW 240 MYTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPIKINLLFPIIYLLFWAFLLVFSLW Sbjct 384 MYTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPIKINLLFPIIYLLFWAFLLVFSLW 443 Query 241 SEPVVCGIGLAIMLTGVPVYFLGVYWQHKPKCFSDFIELLTLVSQKMCVVVYPEVERGSG 300 SEPVVCGIGLAIMLTGVPVYFLGVYWQHKPKCFSDFIELLTLVSQKMCVVVYPEVERGSG Sbjct 444 SEPVVCGIGLAIMLTGVPVYFLGVYWQHKPKCFSDFIELLTLVSQKMCVVVYPEVERGSG 503 Query 301 TEEANEDMEEQQQPMYQPTPTKDKDVAGQPQP 332 TEEANEDMEEQQQPMYQPTPTKDKDVAGQPQP Sbjct 504 TEEANEDMEEQQQPMYQPTPTKDKDVAGQPQP 535 n n n n n n BLAST Sample session, perfect match /24 12
n n n n n n >gi|12643378|sp|Q 9 UM 01|YLA 1_HUMAN <http: //www. ncbi. nlm. nih. gov/entrez/query. fcgi? cmd=Retrieve&db=Protein&list_uids=12643378&dopt=Gen. Pept> Gene info <http: //www. ncbi. nlm. nih. gov/entrez/query. fcgi? db=gene&cmd=search&term=12643378%5 BPUID%5 D> Y+L amino acid transporter 1 (y(+)L-type amino acid transporter 1) (y+LAT-1) (Y+LAT 1) (Monocyte amino acid permease 2) (MOP-2) Length=511 Score = 257 bits (656), Expect = 4 e-68, Method: Composition-based stats. Identities = 138/315 (43%), Positives = 203/315 (64%), Gaps = 10/315 (3%) Query 2 GIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFLQGSFAYGGWNFLNYVTEELVDPYKN 61 GIV++ +G E N+FE +G +ALA F+Y GW+ LNYVTEE+ +P +N Sbjct 202 GIVRLGQGASTHFE--NSFEG-SSFAVGDIALALYSALFSYSGWDTLNYVTEEIKNPERN 258 Query 62 LPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASNAVAVTFGEKLLGVMAWIMPISVAL 121 LP +I IS+P+VT +Y+ NVAY T + +++LAS+AVAVTF +++ G+ WI+P+SVAL Sbjct 259 LPLSIGISMPIVTIIYILTNVAYYTVLDMRDILASDAVAVTFADQIFGIFNWIIPLSVAL 318 Query 122 STFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVKRCTPIPALLFTCISTLLMLVTSDM 181 S FGG+N S+ +SRLFF G+REGHLP + MIHV+R TP+P+LLF I L+ L D+ Sbjct 319 SCFGGLNASIVAASRLFFVGSREGHLPDAICMIHVERFTPVPSLLFNGIMALIYLCVEDI 378 Query 182 YTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPIKINLLFPIIYLLFWAFLLVFSLWS 241 + LINY F + F G+++ GQ+ LRWK+PD PRP+K+++ FPI++ L FL+ L+S Sbjct 379 FQLINYYSFSYWFFVGLSIVGQLYLRWKEPDRPRPLKLSVFFPIVFCLCTIFLVAVPLYS 438 Query 242 EPVVCGIGLAIMLTGVPVYFL--GVYWQHKPKCFSDFIELLTLVSQKMCVVVYPEVERGS 299 + + IG+AI L+G+P YFL V +P + T Q +C+ V E++ Sbjct 439 DTINSLIGIAIALSGLPFYFLIIRVPEHKRPLYLRRIVGSATRYLQVLCMSVAAEMDLED 498 Query 300 GTEEANEDMEEQQQP 314 G E M +Q+ P Sbjct 499 GGE-----MPKQRDP 508 BLAST Sample session, similar match /24 13
BLAST output: Score = 257 bits (656), Expect = 4 e-68, Method: Composition-based stats. Identities = 138/315 (43%), Positives = 203/315 (64%), Gaps = 10/315 (3%) n n /24 BLAST expresses the level of similarity between query sequence and database sequence in terms of: Score, Expectations, Method, Identities, Positives, and Gaps. Here is where our DMA layer is finishing, and from this point inferring need to be done by researchers on the bases of software (ex. BLAST) output, and knowledge gathered elsewhere (book, computers, brains…). Also, a forthcoming challenge in the field of comparative genomic analysis is to compare large amounts of genomic data (letters). For example, if one wants to compare one mammalian genomic sequence against all existing mammalian sequences, one would need a database with memory storage of 60 GB (Saragasso Sea project). 14
Application for text analysis: Frequency (number of occurrences) n Distance -------------n n Exclude stop word lists (and, if, or etc) Stemming (traveling => travel; traveled => travel) Synonyms (sick = ill) n Visual Basic n n /24 15
Home-made Brandy Production n /24 Grape-gathering is the first phase in the production of brandy, through it might be made also from plums, figs, pears or cornel berries. The gathered grapes are crushed and then poured into wooden barrels. They are mixed several times a day, the more often the better. The obtained mass is called wine-marc. The process of alcoholic fermentation usually lasts fifteen or thirty days. When it is finished, or when, as usually people say the marc is still, distillation begins i. e. the making of brandy, which is done in special copper cauldrons. Hand made copper cauldrons can still be found in Tuscany households… 16
word brandy brandy … grape alcohol distillation strength making /24 frequency distance 10 4 3 3 2 0 1 3 5 5 17
Concept criteria: Frequency > 5 n Distance < 2 n n Concepts: brandy grape brandy alcohol n Transcription: brandy - made of - grapes brandy - kind of - alcohol /24 18
Concept Modeling layer n Implicit (concepts are not explicitly mentioned): Protein conceptual model n Explicit (concepts are explicitly mentioned and/or defined): Frequency > 5 Distance < 2 /24 19
Concept definition (CM) n Node in concept network (semantic web) n Node /24 in concept web 20
Concept definitons n n /24 Structure that carries meaning. Needs other concepts and relations among them to be defined. Without relations concept can not exist. Relations between concepts can also be observed as concepts. All concepts can be related among each other, forming whether: 1. concept web (where relations are concepts also) 2. concept network (where relations are not concepts) 21
Concept Network /24 Concept Web 22
Concept Modeling Learning Module /24 23
Thank you for your attention! Questions? /24 24
416813699c7783bd0239bfe92f056c5f.ppt