88c16ebf9b05a2ed89dc9b8e5afd4af6.ppt
- Количество слайдов: 61
Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting Professor, Pohang Univ. of Science & Technology, Republic of Korea Email: raghava@imtech. res. in Web: http: //www. imtech. res. in/raghava/
Major Applications & Challenges n n n n n Introduction to Biology Genome Annotation: Gene Prediction Analysis and Comparison of Sequences Protein Structure Prediction DNA Chip (Microarray) technology Proteomics: Analysis of 2 D gel Fingerprinting Technique Drug development Computer-Aided Vaccine Design
Hierarchy in Biology Atoms Molecules Macromolecules Organelles Cells Tissues Organ Systems Individual Organisms Populations Communities Ecosystems Biosphere
Animal cell
Human Chromosomes
Genes are linearly arranged along chromosomes
Chromosomes and DNA
DNA can be simplified to a string of four letters GATTACA
(RT)
Sequence to Structure: It’s a matter of dimensions! n 1 D Nucleic acid sequence AGT-TTC-CCA-GGG… n 1 D Protein sequence Met-Ala-Gly-Lys-His… M – A – G – K – H… n 3 D Spatial arrangement of atoms
Genome Annotation The Process of Adding Biology Information and Predictions to a Sequenced Genome Framework
Importance of Sequence Comparison n Protein Structure Prediction – Similar sequence have similar structure & function – Phylogenetic Tree – Homology based protein structure prediction n Genome Annotation – Homology based gene prediction – Function assignment & evolutionary studies n Searching drug targets – Searching sequence present or absent across genomes
Protein Sequence Alignment and Database Searching n. Alignment of Two Sequences (Pair-wise Alignment) – The Scoring Schemes or Weight Matrices – Techniques of Alignments – DOTPLOT n. Multiple Sequence Alignment (Alignment of > 2 Sequences) –Extending Dynamic Programming to more sequences –Progressive Alignment (Tree or Hierarchical Methods) –Iterative Techniques n Stochastic Algorithms (SA, GA, HMM) Non Stochastic Algorithms n. Database Scanning – FASTA, BLAST, PSIBLAST, ISS n Alignment of Whole Genomes – MUMmer (Maximal Unique Match) n
Alignment of Two Sequences Dealing Gaps in Pair-wise Alignment Sequence Comparison without Gaps Slide Windos method to got maximum score ALGAWDE ALATWDE Total score= 1+1+0+0+1+1+1=5 ; (PID) = (5*100)/7 Sequence with variable length should use dynamic programming Sequence Comparison with Gaps • Insertion and deletion is common • Slide Window method fails • Generate all possible alignment • 100 residue alignment require > 1075
Alternate Dot Matrix Plot Diagnoal * shows align/identical regions
Dynamic Programming n n n Dynamic Programming allow Optimal Alignment between two sequences Allow Insertion and Deletion or Alignment with gaps Needlman and Wunsh Algorithm (1970) for global alignment Smith & Waterman Algorithm (1981) for local alignment Important Steps – – – Create DOTPLOT between two sequences Compute SUM matrix Trace Optimal Path
Alignment of Multiple Sequences Extending Dynamic Programming to more sequences –Dynamic programming can be extended for more than two –In practice it requires CPU and Memory (Murata et al 1985) – MSA, Limited only up to 8 -10 sequences (1989) –DCA (Divide and Conquer; Stoye et al. , 1997), 20 -25 sequences –OMA (Optimal Multiple Alignment; Reinert et al. , 2000) –COSA (Althaus et al. , 2002) Progressive or Tree or Hierarchical Methods (CLUSTAL-W) –Practical approach for multiple alignment –Compare all sequences pair wise –Perform cluster analysis –Generate a hierarchy for alignment –first aligning the most similar pair of sequences –Align alignment with next similar alignment or sequence
Database scanning Basic principles of Database searching – Search query sequence against all sequence in database – Calculate score and select top sequences – Dynamic programming is best Approximation Algorithms FASTA üFast sequence search üBased on dotplot üIdentify identical words (k-tuples) üSearch significant diagonals üUse PAM 250 for further refinement üDynamic programming for narrow region
Principles of FASTA Algorithms
Database Scanning or Fold Recognition n Concept of PSIBLAST – – n Perform the BLAST search (gap handling) Gene. Improve the sensivity of BLAST rate the position-specific score matrix Use PSSM for next round of search Intermediate Sequence Search – – – Search query against protein database Generate multiple alignment or profile Use profile to search against PDB
Comparison of Whole Genomes n MUMmer (Salzberg group, 1999, 2002) – Pair-wise sequence alignment of genomes – Assume that sequences are closely related – Allow to detect repeats, inverse repeats, SNP – Domain inserted/deleted – Identify the exact matches n How it works – Identify the maximal unique match (MUM) in two genomes – As two genome are similar so larger MUM will be there – Sort the matches found in MUM and extract longest set of possible matches that occurs in same order (Ordered MUM) – Suffix tree was used to identify MUM – Close the gaps by SNPs, large inserts
Protein Structure Prediction n Experimental Techniques – X-ray Crystallography – NMR n Limitations of Current Experimental Techniques – Protein Data. Bank (PDB) -> 24000 protein structures – Swiss. Prot -> 100, 000 proteins – Non-Redudant (NR) -> 1, 000 proteins n Importance of Structure Prediction – Fill gap between known sequence and structures – Protein Engg. To alter function of a protein – Rational Drug Design
Protein Structures
Techniques of Structure Prediction n Computer simulation based on energy calculation – Based on physio-chemical principles – Thermodynamic equilibrium with a minimum free energy – Global minimum free energy of protein surface n Knowledge Based approaches – – – Homology Based Approach Threading Protein Sequence Hierarchical Methods
Energy Minimization Techniques Energy Minimization based methods in their pure form, make no priori assumptions and attempt to locate global minma. n Static Minimization Methods – Classical many potential-potential can be construted – Assume that atoms in protein is in static form – Problems(large number of variables & minima and validity of potentials) n Dynamical Minimization Methods – Motions of atoms also considered – Monte Carlo simulation (stochastics in nature, time is not cosider) – Molecular Dynamics (time, quantum mechanical, classical equ. ) n Limitations – large number of degree of freedom, CPU power not adequate – Interaction potential is not good enough to model
Knowledge Based Approaches n n Homology Modelling – Need homologues of known protein structure – Backbone modelling – Side chain modelling – Fail in absence of homology Threading Based Methods – New way of fold recognition – Sequence is tried to fit in known structures – Motif recognition – Loop & Side chain modelling – Fail in absence of known example
Hierarcial Methods Intermidiate structures are predicted, instead of predicting tertiary structure of protein from amino acids sequence n Prediction of backbone structure – Secondary structure (helix, sheet, coil) – Beta Turn Prediction – Super-secondary structure n n Tertiary structure prediction Limitation Accuracy is only 75 -80 % Only three state prediction
excitation c. DNA clones (probes) laser 2 PCR product amplification purification printing scanning laser 1 emission m. RNA target) overlay images and normalise 0. 1 nl/spot microarray Hybridise target to microarray analysis
Major Applications ü ü ü Identification of differentially expressed genes in diseased tissues (in presence of drug) Classification of differentially expressed (genes) or clustering/ grouping of genes having similar behaviour in different conditions Use expression profile of known disease to diagnosis and classify of unknown genes
Terms/Jargons Stanford/c. DNA chip Affymetrix/oligo chip n one slide/experiment n one chip/experiment n one spot n 1 gene => one spot n one probe/feature/cell or few spots(replica) n control: control spots n 1 gene => many probes (20~25 mers) n control: two n control: match and fluorescent dyes mismatch cells. (Cy 3/Cy 5)
Images : examples Pseudo-colour overlay Cy 3 Cy 5 Spot colour Signal strength Gene expression yellow Control = perturbed unchanged red Control < perturbed induced green Control > perturbed repressed
Processing of images n Addressing or gridding – Assigning coordinates to each of the spots n Segmentation – Classification of pixels either as foreground or as background n Intensity determination for each spot – Foreground fluorescence intensity pairs (R, G) – Background intensities – Quality measures
Management of Microarray Data n Magnitude of Data – Experiments n n n 50 000 genes in human 320 cell types 2000 compunds 3 times points 2 concentrations 2 replicates – Data Volume n n 4*1011 data-points 1015 = 1 peta. B of Data
Management of Microarray Data Major Issues n Large volume of microarray data in last few years – Storage and efficient access – Comparison and integration of data n Problem of data access and exchange – Data scattered around Internet – Supplementary material of publications – Difficult for user to access relivent data n Problems with existing databases – Diverse purpose – Developed for specific purpose
Management of Microarray Data n Specific Database – – – n Platform (eg. Stanford MA Database; SMD) Organism (Yeast MA global viewer) Project (Life cycle database of Drosophila) Problem with Supplement and MA databases – – Lack of direct access Quality not checked No standard format Incomplete data
Pre-processed c. DNA Gene Expression Data On p genes for n slides: p is O(10, 000), n is O(10 -100), but growing, Slides slide 1 Genes 1 2 3 4 5 slide 2 slide 3 slide 4 slide 5 … 0. 46 -0. 10 0. 15 -0. 45 -0. 06 0. 30 0. 49 0. 74 -1. 03 1. 06 0. 80 0. 24 0. 04 -0. 79 1. 35 1. 51 0. 06 0. 10 -0. 56 1. 09 0. 90 0. 46 0. 20 -0. 32 -1. 09 . . . . Gene expression level of gene 5 in slide 4 = Log 2( Red intensity / Green intensity) These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.
Analysis of Microarray Data n n n Analysis of images Preprocessing of gene expression data Normalization of data – – Subtraction of Background Noise Global/local Normalization House keeping genes (or same gene) Expression in ratio (test/references) in log – – Repeats and calculate significance (t-test) Significance of fold used statistical method Differential Gene expression Clustering – Supervised/Unsupervised (Hierarchical, K-means, SOM) Prediction or Supervised Machine Learnning (SVM)
Normalization Techniques n n Global normalization – Divide channel value by means Control spots – Common spots in both channels – House keeping genes – Ratio of intensity of same gene in two channel is used for correction n n Iterative linear regression Parametric nonlinear nomalization – log(CY 3/CY 5) vs log(CY 5)) – Fitted log ratio – observed log ratio n General Non Linear Normalization – – LOESS curve between log(R/G) vs log(sqrt(R. G))
Classification Task: assign objects to classes (groups) on the basis of measurements made on the objects n Unsupervised: classes unknown, want to discover them from the data (cluster analysis) n Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations n
Issues in Clustering n Pre-processing (Image analysis and Normalization) Which genes (variables) are used n Which samples are used n Which distance measure is used n Which algorithm is applied n How to decide the number of clusters K n
Unsupervised Learnning n n n n Hierarchical clustering: merging two branches at the time until all vari-ables (genes) are in one tree. [it does not answer the question of “how many gene clusters there are”? ] K-mean clustering: assuming there are K clusters. [what if this assump-tion is incorrect? ] Model-based clustering: the number of clusters is determined dynami-cally [could be one of the most promising methods]
Supervised Analysis Fisher’s linear discriminant analysis n Quadratic discriminant analysis n Logistic regression (a linear discriminant analysis) n Neural networks n Support vector machine n
Traditional Proteomics 1 D gel electrophoresis (SDS-PAGE) n 2 D gel electrophoresis n Protein Chips n – Chips coated with proteins/Antibodies – large scale version of ELISA n Mass Spectrometry – MALDI: Mass fingerprinting – Electrospray and tandem mass spectrometry Sequencing of Peptides (N->C) n Matching in Genome/Proteome Databases n
Overview of 2 D Gel n SDS-PAGE + Isoelectric focusing (IEF) – Gene Expression Studies – Medical Applications – Sample Experiments n Capturing and Analyzing Data – Image Acquistion – Image Sizing & Orientation – Spot Identification – Matching and Analysis
Comparision/Matcing of Gel Images n Compare 2 gel images – Set X and y axis – Overlap matching spots – Compare intensity of spots n Scan against database – Compare query gel with all gels – Calculate similarity score – Sort based on score
Proteomics: Fingerprints of Disease Normal Cells Disease Cells Phenotypic Changes • Differential protein expression • Protein nitration patterns • Altered phosporylation • Altered glycosylation profiles Utility • Target discovery • Disease pathways • Disease biomarkers
Fingerprinting Technique n What is fingerprinting – – – n Type of Fingerprinting – – n It is technique to create specific pattern for a given organism/person To compare pattern of query and target object To create Phylogenetic tree/classification based on pattern DNA Fingerprinting Mass/peptide fingerprinting Properties based (Toxicity, classification) Domain/conserved pattern fingerprinting Common Applications – – – Paternity and Maternity Criminal Identification and Forensics Personal Identification Classification/Identification of organisms Classification of cells
Fingerprinting Techniques: Principles & Applications n n n What is fingerprinting Type of Fingerprinting Common Applications Role of Computer in DNA Fingerprinting – – – Searching Restriction Enzymes Searching VNTRs Computation of size of DNA fragments Optimization of gels Comparison of patterns Creation of Phylogenetic tree
Drug Design History of Drug/Vaccine development – Plants or Natural Product n n Plant and Natural products were source for medical substance Example: foxglove used to treat congestive heart failure Foxglove contain digitalis and cardiotonic glycoside Identification of active component – Accidental Observations n n n Penicillin is one good example Alexander Fleming observed the effect of mold Mold(Penicillium) produce substance penicillin Discovery of penicillin lead to large scale screening Soil micoorganism were grown and tested Streptomycin, neomycin, gentamicin, tetracyclines etc.
Drug Design n Chemical Modification of Known Drugs – Drug improvement by chemical modification – Pencillin G -> Methicillin; morphine->nalorphine n Receptor Based drug design – – n Receptor is the target (usually a protein) Drug molecule binds to cause biological effects It is also called lock and key system Structure determination of receptor is important Ligand-based drug design – Search a lead ocompound or active ligand – Structure of ligand guide the drug design process
Drug Design based on Bioinformatics Tools n Detect the Molecular Bases for Disease – Detection of drug binding site – Tailor drug to bind at that site – Protein modeling techniques – Traditional Method (brute force testing) n Rational drug design techniques – Screen likely compounds built – Modeling large number of compounds (automated) – Application of Artificial intelligence – Limitation of known structures
Important Points in Drug Design based on Bioinformatics Tools n Application of Genome – – – 3 billion bases pair 30, 000 unique genes Any gene may be a potential drug target ~500 unique target Their may be 10 to 100 variants at each target gene – 1. 4 million SNP – 10200 potential small molecules
Concept of Drug and Vaccine n Concept of Drug – Kill invaders of foreign pathogens – Inhibit the growth of pathogens n Concept of Vaccine – Generate memory cells – Trained immune system to face various existing disease agents
VACCINES A. SUCCESS STORY: • COMPLETE ERADICATION OF SMALLPOX • WHO PREDICTION : ERADICATION OF PARALYTIC POLIO THROUGHOUT THE WORLD BY YEAR 2003 • SIGNIFICANT REDUCTION OF INCIDENCE OF DISEASES: DIPTHERIA, MEASLES, MUMPS, PERTUSSIS, RUBELLA, POLIOMYELITIS, TETANUS B. NEED OF AN HOUR 1) SEARCH FOR NONAVAILABILE EFFECTIVE VACCINES FOR DISEASES LIKE: MALARIA, TUBERCULOSIS AND AIDS 2) IMPROVEMENT IN SAFETY AND EFFICACY OF PRESENT VACCINES 3) LOW COST 4) EFFICIENT DELIVERY TO NEEDY 5) REDUCTION OF ADVERSE SIDE EFFECTS
Computer Aided Vaccine Design n Whole Organism of Pathogen – Consists more than 4000 genes and proteins – Genomes have millions base pair n Target antigen to recognise pathogen – Search vaccine target (essential and non-self) – Consists of amino acid sequence (e. g. A-V-L-G -Y-R-G-C-T ……) n Search antigenic region (peptide of length 9 amino acids)
Major steps of endogenous antigen processing
Computer Aided Vaccine Design n Problem of Pattern Recognition – ATGGTRDAR – LMRGTCAAY – RTTGTRAWR – EMGGTCAAY – ATGGTRKAR – GTCVGYATT n Epitope Non-epitope Epitope Commonly used techniques – Statistical (Motif and Matrix) – AI Techniques
Why computational tools are required for prediction. 200 aa proteins Chopped to overlapping peptides of 9 amino acids Bioinformatics Tools 192 peptides 10 -20 predicted peptides invitro or invivo experiments for detecting which snippets of protein will spark an immune response.
Thanks
88c16ebf9b05a2ed89dc9b8e5afd4af6.ppt