Role of Computer and Information Science in Biology

Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting Professor, Pohang Univ. of Science & Technology, Republic of Korea Email: raghava@imtech. res. in Web: http: //www. imtech. res. in/raghava/

Major Applications & Challenges n n n n n Introduction to Biology Genome Annotation: Gene Prediction Analysis and Comparison of Sequences Protein Structure Prediction DNA Chip (Microarray) technology Proteomics: Analysis of 2 D gel Fingerprinting Technique Drug development Computer-Aided Vaccine Design

Hierarchy in Biology Atoms Molecules Macromolecules Organelles Cells Tissues Organ Systems Individual Organisms Populations Communities Ecosystems Biosphere

Animal cell

Human Chromosomes

Genes are linearly arranged along chromosomes

Chromosomes and DNA

DNA can be simplified to a string of four letters GATTACA

(RT)

Sequence to Structure: It’s a matter of dimensions! n 1 D Nucleic acid sequence AGT-TTC-CCA-GGG… n 1 D Protein sequence Met-Ala-Gly-Lys-His… M – A – G – K – H… n 3 D Spatial arrangement of atoms

Genome Annotation The Process of Adding Biology Information and Predictions to a Sequenced Genome Framework

Importance of Sequence Comparison n Protein Structure Prediction – Similar sequence have similar structure & function – Phylogenetic Tree – Homology based protein structure prediction n Genome Annotation – Homology based gene prediction – Function assignment & evolutionary studies n Searching drug targets – Searching sequence present or absent across genomes

Protein Sequence Alignment and Database Searching n. Alignment of Two Sequences (Pair-wise Alignment) – The Scoring Schemes or Weight Matrices – Techniques of Alignments – DOTPLOT n. Multiple Sequence Alignment (Alignment of > 2 Sequences) –Extending Dynamic Programming to more sequences –Progressive Alignment (Tree or Hierarchical Methods) –Iterative Techniques n Stochastic Algorithms (SA, GA, HMM) Non Stochastic Algorithms n. Database Scanning – FASTA, BLAST, PSIBLAST, ISS n Alignment of Whole Genomes – MUMmer (Maximal Unique Match) n

Alignment of Two Sequences Dealing Gaps in Pair-wise Alignment Sequence Comparison without Gaps Slide Windos method to got maximum score ALGAWDE ALATWDE Total score= 1+1+0+0+1+1+1=5 ; (PID) = (5*100)/7 Sequence with variable length should use dynamic programming Sequence Comparison with Gaps • Insertion and deletion is common • Slide Window method fails • Generate all possible alignment • 100 residue alignment require > 1075

Alternate Dot Matrix Plot Diagnoal * shows align/identical regions

Dynamic Programming n n n Dynamic Programming allow Optimal Alignment between two sequences Allow Insertion and Deletion or Alignment with gaps Needlman and Wunsh Algorithm (1970) for global alignment Smith & Waterman Algorithm (1981) for local alignment Important Steps – – – Create DOTPLOT between two sequences Compute SUM matrix Trace Optimal Path

Alignment of Multiple Sequences Extending Dynamic Programming to more sequences –Dynamic programming can be extended for more than two –In practice it requires CPU and Memory (Murata et al 1985) – MSA, Limited only up to 8 -10 sequences (1989) –DCA (Divide and Conquer; Stoye et al. , 1997), 20 -25 sequences –OMA (Optimal Multiple Alignment; Reinert et al. , 2000) –COSA (Althaus et al. , 2002) Progressive or Tree or Hierarchical Methods (CLUSTAL-W) –Practical approach for multiple alignment –Compare all sequences pair wise –Perform cluster analysis –Generate a hierarchy for alignment –first aligning the most similar pair of sequences –Align alignment with next similar alignment or sequence

Database scanning Basic principles of Database searching – Search query sequence against all sequence in database – Calculate score and select top sequences – Dynamic programming is best Approximation Algorithms FASTA üFast sequence search üBased on dotplot üIdentify identical words (k-tuples) üSearch significant diagonals üUse PAM 250 for further refinement üDynamic programming for narrow region

Principles of FASTA Algorithms

Database Scanning or Fold Recognition n Concept of PSIBLAST – – n Perform the BLAST search (gap handling) Gene. Improve the sensivity of BLAST rate the position-specific score matrix Use PSSM for next round of search Intermediate Sequence Search – – – Search query against protein database Generate multiple alignment or profile Use profile to search against PDB

Comparison of Whole Genomes n MUMmer (Salzberg group, 1999, 2002) – Pair-wise sequence alignment of genomes – Assume that sequences are closely related – Allow to detect repeats, inverse repeats, SNP – Domain inserted/deleted – Identify the exact matches n How it works – Identify the maximal unique match (MUM) in two genomes – As two genome are similar so larger MUM will be there – Sort the matches found in MUM and extract longest set of possible matches that occurs in same order (Ordered MUM) – Suffix tree was used to identify MUM – Close the gaps by SNPs, large inserts

Protein Structure Prediction n Experimental Techniques – X-ray Crystallography – NMR n Limitations of Current Experimental Techniques – Protein Data. Bank (PDB) -> 24000 protein structures – Swiss. Prot -> 100, 000 proteins – Non-Redudant (NR) -> 1, 000 proteins n Importance of Structure Prediction – Fill gap between known sequence and structures – Protein Engg. To alter function of a protein – Rational Drug Design

Protein Structures

Techniques of Structure Prediction n Computer simulation based on energy calculation – Based on physio-chemical principles – Thermodynamic equilibrium with a minimum free energy – Global minimum free energy of protein surface n Knowledge Based approaches – – – Homology Based Approach Threading Protein Sequence Hierarchical Methods

Energy Minimization Techniques Energy Minimization based methods in their pure form, make no priori assumptions and attempt to locate global minma. n Static Minimization Methods – Classical many potential-potential can be construted – Assume that atoms in protein is in static form – Problems(large number of variables & minima and validity of potentials) n Dynamical Minimization Methods – Motions of atoms also considered – Monte Carlo simulation (stochastics in nature, time is not cosider) – Molecular Dynamics (time, quantum mechanical, classical equ. ) n Limitations – large number of degree of freedom, CPU power not adequate – Interaction potential is not good enough to model

Knowledge Based Approaches n n Homology Modelling – Need homologues of known protein structure – Backbone modelling – Side chain modelling – Fail in absence of homology Threading Based Methods – New way of fold recognition – Sequence is tried to fit in known structures – Motif recognition – Loop & Side chain modelling – Fail in absence of known example

Hierarcial Methods Intermidiate structures are predicted, instead of predicting tertiary structure of protein from amino acids sequence n Prediction of backbone structure – Secondary structure (helix, sheet, coil) – Beta Turn Prediction – Super-secondary structure n n Tertiary structure prediction Limitation Accuracy is only 75 -80 % Only three state prediction

excitation c. DNA clones (probes) laser 2 PCR product amplification purification printing scanning laser 1 emission m. RNA target) overlay images and normalise 0. 1 nl/spot microarray Hybridise target to microarray analysis

Major Applications ü ü ü Identification of differentially expressed genes in diseased tissues (in presence of drug) Classification of differentially expressed (genes) or clustering/ grouping of genes having similar behaviour in different conditions Use expression profile of known disease to diagnosis and classify of unknown genes

Terms/Jargons Stanford/c. DNA chip Affymetrix/oligo chip n one slide/experiment n one chip/experiment n one spot n 1 gene => one spot n one probe/feature/cell or few spots(replica) n control: control spots n 1 gene => many probes (20~25 mers) n control: two n control: match and fluorescent dyes mismatch cells. (Cy 3/Cy 5)

Images : examples Pseudo-colour overlay Cy 3 Cy 5 Spot colour Signal strength Gene expression yellow Control = perturbed unchanged red Control < perturbed induced green Control > perturbed repressed

Processing of images n Addressing or gridding – Assigning coordinates to each of the spots n Segmentation – Classification of pixels either as foreground or as background n Intensity determination for each spot – Foreground fluorescence intensity pairs (R, G) – Background intensities – Quality measures

Management of Microarray Data n Magnitude of Data – Experiments n n n 50 000 genes in human 320 cell types 2000 compunds 3 times points 2 concentrations 2 replicates – Data Volume n n 4*1011 data-points 1015 = 1 peta. B of Data

Management of Microarray Data Major Issues n Large volume of microarray data in last few years – Storage and efficient access – Comparison and integration of data n Problem of data access and exchange – Data scattered around Internet – Supplementary material of publications – Difficult for user to access relivent data n Problems with existing databases – Diverse purpose – Developed for specific purpose

Management of Microarray Data n Specific Database – – – n Platform (eg. Stanford MA Database; SMD) Organism (Yeast MA global viewer) Project (Life cycle database of Drosophila) Problem with Supplement and MA databases – – Lack of direct access Quality not checked No standard format Incomplete data

Pre-processed c. DNA Gene Expression Data On p genes for n slides: p is O(10, 000), n is O(10 -100), but growing, Slides slide 1 Genes 1 2 3 4 5 slide 2 slide 3 slide 4 slide 5 … 0. 46 -0. 10 0. 15 -0. 45 -0. 06 0. 30 0. 49 0. 74 -1. 03 1. 06 0. 80 0. 24 0. 04 -0. 79 1. 35 1. 51 0. 06 0. 10 -0. 56 1. 09 0. 90 0. 46 0. 20 -0. 32 -1. 09 . . . . Gene expression level of gene 5 in slide 4 = Log 2( Red intensity / Green intensity) These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.

Analysis of Microarray Data n n n Analysis of images Preprocessing of gene expression data Normalization of data – – Subtraction of Background Noise Global/local Normalization House keeping genes (or same gene) Expression in ratio (test/references) in log – – Repeats and calculate significance (t-test) Significance of fold used statistical method Differential Gene expression Clustering – Supervised/Unsupervised (Hierarchical, K-means, SOM) Prediction or Supervised Machine Learnning (SVM)

Normalization Techniques n n Global normalization – Divide channel value by means Control spots – Common spots in both channels – House keeping genes – Ratio of intensity of same gene in two channel is used for correction n n Iterative linear regression Parametric nonlinear nomalization – log(CY 3/CY 5) vs log(CY 5)) – Fitted log ratio – observed log ratio n General Non Linear Normalization – – LOESS curve between log(R/G) vs log(sqrt(R. G))

Classification Task: assign objects to classes (groups) on the basis of measurements made on the objects n Unsupervised: classes unknown, want to discover them from the data (cluster analysis) n Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations n

Issues in Clustering n Pre-processing (Image analysis and Normalization) Which genes (variables) are used n Which samples are used n Which distance measure is used n Which algorithm is applied n How to decide the number of clusters K n

Unsupervised Learnning n n n n Hierarchical clustering: merging two branches at the time until all vari-ables (genes) are in one tree. [it does not answer the question of “how many gene clusters there are”? ] K-mean clustering: assuming there are K clusters. [what if this assump-tion is incorrect? ] Model-based clustering: the number of clusters is determined dynami-cally [could be one of the most promising methods]

Supervised Analysis Fisher’s linear discriminant analysis n Quadratic discriminant analysis n Logistic regression (a linear discriminant analysis) n Neural networks n Support vector machine n

Traditional Proteomics 1 D gel electrophoresis (SDS-PAGE) n 2 D gel electrophoresis n Protein Chips n – Chips coated with proteins/Antibodies – large scale version of ELISA n Mass Spectrometry – MALDI: Mass fingerprinting – Electrospray and tandem mass spectrometry Sequencing of Peptides (N->C) n Matching in Genome/Proteome Databases n

Overview of 2 D Gel n SDS-PAGE + Isoelectric focusing (IEF) – Gene Expression Studies – Medical Applications – Sample Experiments n Capturing and Analyzing Data – Image Acquistion – Image Sizing & Orientation – Spot Identification – Matching and Analysis

Comparision/Matcing of Gel Images n Compare 2 gel images – Set X and y axis – Overlap matching spots – Compare intensity of spots n Scan against database – Compare query gel with all gels – Calculate similarity score – Sort based on score

Proteomics: Fingerprints of Disease Normal Cells Disease Cells Phenotypic Changes • Differential protein expression • Protein nitration patterns • Altered phosporylation • Altered glycosylation profiles Utility • Target discovery • Disease pathways • Disease biomarkers

Fingerprinting Technique n What is fingerprinting – – – n Type of Fingerprinting – – n It is technique to create specific pattern for a given organism/person To compare pattern of query and target object To create Phylogenetic tree/classification based on pattern DNA Fingerprinting Mass/peptide fingerprinting Properties based (Toxicity, classification) Domain/conserved pattern fingerprinting Common Applications – – – Paternity and Maternity Criminal Identification and Forensics Personal Identification Classification/Identification of organisms Classification of cells

Fingerprinting Techniques: Principles & Applications n n n What is fingerprinting Type of Fingerprinting Common Applications Role of Computer in DNA Fingerprinting – – – Searching Restriction Enzymes Searching VNTRs Computation of size of DNA fragments Optimization of gels Comparison of patterns Creation of Phylogenetic tree

Drug Design History of Drug/Vaccine development – Plants or Natural Product n n Plant and Natural products were source for medical substance Example: foxglove used to treat congestive heart failure Foxglove contain digitalis and cardiotonic glycoside Identification of active component – Accidental Observations n n n Penicillin is one good example Alexander Fleming observed the effect of mold Mold(Penicillium) produce substance penicillin Discovery of penicillin lead to large scale screening Soil micoorganism were grown and tested Streptomycin, neomycin, gentamicin, tetracyclines etc.

Drug Design n Chemical Modification of Known Drugs – Drug improvement by chemical modification – Pencillin G -> Methicillin; morphine->nalorphine n Receptor Based drug design – – n Receptor is the target (usually a protein) Drug molecule binds to cause biological effects It is also called lock and key system Structure determination of receptor is important Ligand-based drug design – Search a lead ocompound or active ligand – Structure of ligand guide the drug design process

Drug Design based on Bioinformatics Tools n Detect the Molecular Bases for Disease – Detection of drug binding site – Tailor drug to bind at that site – Protein modeling techniques – Traditional Method (brute force testing) n Rational drug design techniques – Screen likely compounds built – Modeling large number of compounds (automated) – Application of Artificial intelligence – Limitation of known structures

Important Points in Drug Design based on Bioinformatics Tools n Application of Genome – – – 3 billion bases pair 30, 000 unique genes Any gene may be a potential drug target ~500 unique target Their may be 10 to 100 variants at each target gene – 1. 4 million SNP – 10200 potential small molecules

Concept of Drug and Vaccine n Concept of Drug – Kill invaders of foreign pathogens – Inhibit the growth of pathogens n Concept of Vaccine – Generate memory cells – Trained immune system to face various existing disease agents

VACCINES A. SUCCESS STORY: • COMPLETE ERADICATION OF SMALLPOX • WHO PREDICTION : ERADICATION OF PARALYTIC POLIO THROUGHOUT THE WORLD BY YEAR 2003 • SIGNIFICANT REDUCTION OF INCIDENCE OF DISEASES: DIPTHERIA, MEASLES, MUMPS, PERTUSSIS, RUBELLA, POLIOMYELITIS, TETANUS B. NEED OF AN HOUR 1) SEARCH FOR NONAVAILABILE EFFECTIVE VACCINES FOR DISEASES LIKE: MALARIA, TUBERCULOSIS AND AIDS 2) IMPROVEMENT IN SAFETY AND EFFICACY OF PRESENT VACCINES 3) LOW COST 4) EFFICIENT DELIVERY TO NEEDY 5) REDUCTION OF ADVERSE SIDE EFFECTS

Computer Aided Vaccine Design n Whole Organism of Pathogen – Consists more than 4000 genes and proteins – Genomes have millions base pair n Target antigen to recognise pathogen – Search vaccine target (essential and non-self) – Consists of amino acid sequence (e. g. A-V-L-G -Y-R-G-C-T ……) n Search antigenic region (peptide of length 9 amino acids)

Major steps of endogenous antigen processing

Computer Aided Vaccine Design n Problem of Pattern Recognition – ATGGTRDAR – LMRGTCAAY – RTTGTRAWR – EMGGTCAAY – ATGGTRKAR – GTCVGYATT n Epitope Non-epitope Epitope Commonly used techniques – Statistical (Motif and Matrix) – AI Techniques

Why computational tools are required for prediction. 200 aa proteins Chopped to overlapping peptides of 9 amino acids Bioinformatics Tools 192 peptides 10 -20 predicted peptides invitro or invivo experiments for detecting which snippets of protein will spark an immune response.

Thanks