What is Systems Biology Systems

What is Systems Biology? § § § Systems biology is an academic field that seeks to integrate different levels of information to understand how biological systems function. By studying the relationships and interactions between various parts of a biological system (e. g. , gene and protein networks involved in cell signaling, metabolic pathways, organelles, cells, physiological systems, organisms etc. ), it is hoped that eventually an understandable model of the whole system can be developed. (from Wikipedia) Use high-throughput methods to quantify changes in RNA and protein in response to perturbation of cell Build regulatory networks linking genes, RNAs, and proteins Develop mathematical models to represent the system Predict how different perturbations will affect the system Test predictions for validity Refine models and repeat

Why use Systems Biology? § Whole organism view • Identify total physiological capacity • More complete understanding • New drugs/vaccine candidates § Produce resource of data and materials • More efficient • Collaborative

Systems Biology approaches § Genome (DNA sequencing) § Transcriptome (RNA microarrays) § Proteome (Mass spectrometry) § Metabolome (Mass spectrometry) § Phenome (Cell biology) § ‘ome (anything else)

DNA microarrays § Different uses • • Comparative genomic hybridization (CGH) Resequencing/SNP analysis Expression profiling Chromatin immunoprecipitation § Data analysis • • Normalization T-testing Analysis of variance (ANOVA) Clustering

What are DNA microarrays § Spots of DNA arranged on a solid support (usually glass or silicon) § Different sources of DNA • Cloned DNA (genomic or c. DNA) – spotted on glass • PCR products – spotted on glass • Oligonucleotides (25 - to 70 -mers) Ø Spotted or printed onto glass Ø Synthesized directly on silicon § Different densities • Spotted or printed – 5, 000 -30, 000/slide • Synthesized oligos – 200, 000 -500, 000/slide

How do microarrays work? § Label m. RNA or g. DNA with fluorescent probe § Hybridize to microarray and wash off excess probe § Read in a fluorescent scanner § Quantify signal for each spot § Signal ~ hybridization ~ abundance of sequence in probe

One-color (Affymetrix or Nimblegen)

Two color (Spotted or printed)

A typical two color microarray Plot red vs green intensity Leishmania procyclics vs metacyclics • • • Equal green/red signal = yellow Not differentially expressed Greater green signal Procyclic-specific Greater red signal Metacyclic-specific

Problems with microarrays § Cross-hybridization between probes • false positives (wrong gene) • false negatives (hides differential expression) • oligos are better § Poor experimental design • Not enough replicates § Inappropriate analysis • Normalization of signal within and between arrays • Need robust statistical analysis

Within Array Normalization Lowess Normalization MA - Plots

Between Array Normalization § § Invariant gene(s) RNA Spike In Median Scaling Quantile Scaling Median and Quantile normalization are predicated upon the arrays in question having the same distribution. That is to say, if you can safely assume that the bulk of genes have the same expression across the arrays, only then you can use those methods.

Quantile Normalization Robust Multichip Average (RMA) http: //rmaexpress. bmbolstad. com

Finding Significant Genes § Assume we will compare two conditions with multiple replicates for each class § Our goal is to find genes that are significantly different between these classes § These are the genes that we will use for later data mining

Fold-change Difference 2 -fold ? § suffers from being arbitrary and not taking into account systematic variation in the data

T-testing t = signal = difference between means = <Xq> – <Xc>_ noise variability of groups SE(Xq-Xc) § Tests whether the difference between the mean of the query and reference groups are the same § Essentially measures signal-to-noise § Calculate p-value (permutations or distributions)

Improvement over fold-change A significant difference Probably not But: If you use pooled RNAs, you can’t tell the difference between the top and bottom cases.

Analysis of variation (ANOVA) ? ? ? § Which genes are most significant for separating classes of samples? § Calculate p-value (permutations or distributions) § Reduces to a t-test for 2 samples

ANOVA § Assign experiments to >2 groups § Calculate F-ratio for each gene • F = mean square (groups)/mean square (error) • Between group variability/within group variability • The large the value of F, the greater the difference between group means relative to the sampling error variability § Calculate p-value associated with F-ratio

Probability value determination § Calculated from: • Theoretical t-distribution • Permutation § Correction for multiple testing • • Family Wise Error Rate (FWER) Bonferroni – too stringent Adjusted Bonferroni Benjamin and Hochberg False Discovery Rate (FDR)

Finding patterns of expression § Individual genes don’t tell the whole story § Identify groups of genes with similar differential expression patterns § Cluster analysis § Statistical reliability is still an issue

Clustering algorithms § Inputs • Raw data matrix or similarity matrix • Number of clusters or some other parameters § Classification of clustering algorithms • Hierarchical vs. partitional • Heuristic-based vs. model-based • Soft vs. hard

Hierarchical clustering • Cluster genes with most similar expression patterns • Cluster samples with most similar gene expression

Example of clustering

Microarray analysis software § SAM (Significance Analysis of Microarrays) http: //www-stat. stanford. edu/~tibs/SAM/ § TM 4 (MIDAS, MADAM, MEV, Spotfinder) http: //www. tigr. org/software/microarray. shtml § Bioconductor http: //www. bioconductor. org/ § Gene. Spring GX http: //www. chem. agilent. com/scripts/pds. asp? lpage=27881 § Rosetta Resolver http: //www. rosettabio. com/products/resolver/default. htm § Many others http: //ihome. cuhk. edu. hk/~b 400559/arraysoft. html

Proteomics § 2 -D gel electrophoresis • Isoelectric focusing Ø ISO-DALT Ø NEPHGE Ø IPG-DALT • SDS-PAGE • Computer-aided image analysis § Protein Identification • Edman degradation Ø expensive Ø slow • Mass-spectometery (MALDI-TOF-MS, LC/EIS-MS) Ø Sensitive Ø High-throughput

Mass spectrometry in proteomics § Molecular Weight determination § Protein identification § Relative quantitation § Post-translational modifications § Biomolecular interactions

What is Mass Spectrometry? § Proteins are separated or filtered according to their mass-to-charge (m/z) ratio and detected. § The resulting mass spectrum is a plot of the (relative) abundance of the produced ions as a function of the m/z ratio § Usually carried out after liquid chromatography (LC -MS/MS) or matrix assisted laser desorption ionization (MALDI-TOF) § The sequence of the peptide is determined by comparison of acquired mass spectrum with predicted spectrum from genome / protein sequence databases, using computer algorithms

Typical proteomics protocol Cell / Organism Lysis and Fractionation Protein purification • Chromatography • 1 D gel • 2 D gel Sequence Analysis using MS/MS Enzymatic digestion of the protein(s) Separation of resulting peptides by chromatography • Reverse phase • IEX - RP

Tandem MS 2. Full Scan MS 2 * I n t e n s i t y 1. Full Scan MS 3 * * ion selected Time

Collision Induced Dissociation spectrum Amino acid sequence of a peptide identified by MS/MS analysis from the tryptic digest of p 46 S-A-V-F-A-A-P-R

Peptide identification from CID spectra s 072999_ap_tb 07. 0369. 2. out SEQUEST v. 22, Copyright 1993 -96 # Rank (M+H)+ C*104 Reference Peptide 1. 1 1037. 1 3. 9923 m. HEL 61 (F) SSGKVRVCER 2. 2 1037. 2 2. 9684 6 A 9. TR (V)VGGIGTTFER 3. 3 1037. 2 2. 8651 gi 1395223 (A)RFFEAGNVP 4. 4 1037. 1 2. 7472 18 L 22. TF (R)VDDSGKMER 5. 5 1037. 1 2. 7390 tryp. Ef 4. p 1 p (S)VDDAYM*IGH

Protein identification from multiple peptides >gb-AAK 64278. 1 Trypanosoma brucei RNA-editing complex protein MP 81 gene, complete cds; nuclear gene for mitochondrial product MRRLTRRSGR LSGKGNGGSC LQMSPTHVGA VVTWALNRLM PLHTRTIPLR CSLPTPESGT TEPRELCFYE TFELTEEDVH YLLLHEAHVK HGVLLNVPPQ LAPNGTPPEV PEVIMPAAQL ERMGGMKLAY EPTHLPPPLH TTGARQLVLD ESFYTTPTKE KKATTTAVSH VSESTAASGG RGGASATAAG TALPPRLPPD PTMKFHCSAC GKAFRLKFSA DHHVKLNHGS DPKAAVVDGP GEGELLGGAV TITTAKVAKH SSSAASGTAS RAGDSATLDV KQQPDPQKEL SAPGISAVKI PYSKAVLSLP DDELVDELLI DVWDAVAAQR DDVPKSNSAN IFLPFASVVT GTADRRKEME AVARPTARAT PEGAAPGIKR PGAMAGGAVA VGKGRSGGQI LPIRELIKKY PNPFGDSPNA AVQDLENEPL NPFLPEEELA AQLQVACEED TVVTPSACTT DVSTGSVIGK KGSLEKLKEK LRGTRPSMAA SAAKRRFTCP ICVEKQQTLQ QQQSENVGSG FCTDIPSFRL LDALLDHVES VHGEELTEDQ LRELYAKQRQ STLYPQKSST GDGAGSRETP DDSEKKEGSV GNTNMDELKS LPEEVRRVVP PAPVEQDALA VHIRAGSNAL MIGRIADVQH GFLGAMTVTQ YVLEVDGDER INSKGVTTPA SACTPDPAST KAVEAKGEEG EVVEPEKEFI VIRCMGDNFP ASLLKDQVKL GSRVLVQGTL RMNRHVDDVS KRLHAYPFIQ VVPPLGYVKV VG Mass (average): 81294. 2 Identifier: gi|14495336 Database: D: /Xcalibur/database//t_bruceiprot. fasta Protein Coverage: 223/762 = 29. 3% by amino acid count, 23252. 5/81294. 2 = 28. 6% by mass

Relative quantitation using ICAT (Isotope Coded Affinity Tagging) Gygi et al. Nat Biotech 17: 994 (1999)

Multidimensional Protein Identification Technology (Mud. PIT) § § § High throughput 100 s of proteins Reiterative Exclude previous ions 1000 s of proteins Washburn et al. Nat Biotech, 19: 242 (2001)

Software § Data Acquisition • Xcalbiur (proprietary) § CID spectrum filtration • In-house programs § Peptide identification • SEQUEST, Mascot, Prob. ID, COMET § Compilation • DTA select, Contrast, Peptide. Prophet § Protein assignment • SEQUEST, Protein. Prophet • LIMS • SBeams

Pathways and networks in systems biology Linking genes related to cellular processes Elucidating the effect changes in biochemical pathways may have on cellular biology Using microarrays to find coexpression and infer systemic relation Identifying interactions and networks between multiple proteins Finding and charting the flow of chemical compounds created by biochemical processes

Pathways vs. networks Gene networks • Clusters of genes (or gene products) with evidence of coexpression • Connections usually represent degrees of co-expression • In-depth knowledge of process is not necessary • Networks are non-predictive Biochemical pathways • Series of chained, chemical reactions • Connections represent describable (and quantifiable) relations between molecules, proteins, lipids, etc. • Enzymatic process is elucidated • Changes via perturbation are predictable downstream

Pathways vs. networks Gene networks Biochemical pathways Curation Relatively easy: automated Difficult: mostly manual and manual Nodes Genes or gene products Any general molecule Edges Levels of co- Representation of possibly quantifiable mechanisms between compounds expression/influence or a qualitative relation Fidelity Low – usually very little High – specific processes detail Predictive power Relatively low Relatively high

Network software/databases § Biocyc/Metacyc § KEGG § Bio. Carta § Bio. Models § Cytoscape § E-cell

Bio. Cyc/Metacyc § http: //biocyc. org/ & http: //metacyc. org § Krieger et al. , Nuc. Acids Res. 32: D 438 (2004) § Pathway analysis for >900 organisms

Bio. Cyc/Metacyc § 260 organism-specific databases • Automated annotation using Pathologic software (Tier 3) • Some manual curation (Tier 2) (H. sapiens, P. falciparum, 11 bacteria) • Extensive manual curation (Tier 1) (Eco. Cyc and Metacyc)

Kyoto Encyclopedia of Genes and Genomes § http: //www. genome. jp/kegg/ § Pathways from 348 organisms § Links with other databases

Kyoto Encyclopedia of Genes and Genomes

Bio. Carta database § Corporate-owned, publicly-curated pathway database § Series of interactive, “cartoon” pathway maps § Predominantly human and mouse pathways § Contains 160, 000 gene entries and 355 pathways http: //www. biocarta. com

http: //www. biocarta. com

Glycolysis pathway http: //www. biocarta. com

Bio. Models database § Database for published, quantitative models of biochemical processes § All models/pathways curated manually, compliant with MIRIAM § Models can be output in SBML format for quantitative modeling § 86 curated models, 40 models pending curation http: //www. biomodels. net

http: //www. biomodels. net

Glycolysis pathway(s) http: //www. biomodels. net

Comparison of pathway databases Meta. Cyc/ Bio. Cyc Curation Manual and KEGG PATHWAYS Bio. Carta Bio. Models Automated Manual ~289 reference pathways ~355 pathways ~126 models EC, KO None GO Various Primarily human and mouse ~475 species Reference and species-specific Animated, cartoonish Non-standardized PGDB, pathway comparisons Human pathways, Simulations, disease modeling automated Size ~621+ pathways Nomenclature EC, GO Organism ~500 species coverage Visuals Species-specific custom Primary usage PGDB, computational biology

Cytoscape http: //www. cytoscape. org/index. php

Cytoscape § Cytoscape is a bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data. § Plugins are available for network and molecular profiling analyses, new layouts, additional file format support and connection with databases.

Cytoscape § Input • Molecular interaction networks such as protein-protein (yeast 2 -hybrid and TAP-tag) and/or protein-DNA interaction pairs (e. g. BIND and TRANSFAC databases) • m. RNA expression profiles • Gene functional annotations from the Gene Ontology (GO) and KEGG databases. § Visualization • Customize network data display using powerful visual styles. • View a superposition of gene expression ratios and p-values on the network. Expression data can be mapped to node color, label, border thickness, or border color, etc. • Layout networks in two dimensions. A variety of layout algorithms are available, including cyclic and spring-embedded layouts. § Analysis • • Filter the network to select subsets of nodes and/or interactions Find active sub-networks/pathway modules Find clusters (highly interconnected regions) in any network loaded into Cytoscape. More plugins available on the plugins page.

Cytoscape

E-cell § E-Cell is an international research project aiming to model and reconstruct biological phenomena in silico, and developing necessary theoretical supports, technologies and software platforms to allow precise whole cell simulation • Modeling methodologies, formalisms and techniques, including technologies to predict, obtain or estimate parameters such as reaction rates and concentrations of molecules in the cell • E-Cell System, a software platform for modeling, simulation and analysis of complex, heterogeneous and multi-scale systems • Numerical simulation algorithms • Mathematical analysis methods

E-cell http: //www. e-cell. org/

E-cell projects § § § Mitochondrion (Yugi) E-Neuron (Kikuchi) E 2 coli (Hashimoto) e-Rice (Ishii, Nakayama) Erythrocyte (Kinoshita, Nakayama) Cell Signaling (Shimizu) Bacterial chemotaxis (Matsuzaki) Circadian rhythm (Miyoshi, Nakayama) Diabetes (Sano, Naito) Mathematical Analysis (Kikuchi) Myocardial Cell

E-CELL simulation environment

Image from Tomita, et al. , 2001

ATP starvation simulation ATP level m. RNA level Images from Tomita, et al. , 1999