
ab79bb9d1cfd6db496438770bd69fca5.ppt
- Количество слайдов: 64
Challenges for metagenomic data analysis and lessons from viral metagenomes [What would you do if sequencing were free? ] Rob Edwards San Diego State University Fellowship for Interpretation of Genomes The Burnham Institute for Medical Research
Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future?
Why Metagenomics? • What is there? • How many are there? • What are they doing?
How do you sequence the environment? • Extract DNA
Cs. Cl step gradient 1. 1 g ml-1 1. 35 g ml-1 1. 7 g ml-1
Cs. Cl step gradient
How do you sequence the environment? • Extract DNA • Create library
Linker-Amplified Shotgun Libraries (LASLs) Soil Extraction Kit This method produces high coverage libraries of over 1 million clones from as little as 1 ng DNA David Mead - Breitbart (2002) PNAS
How do you sequence the environment? • Extract DNA • Create library • Sequence fragments
Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future?
Why Phages? • Phages are viruses that infect bacteria – 10: 1 ratio of phages: bacteria – 1031 phages on the planet • Specific interactions (probably) – one virus : one host • Small genome size – Higher coverage • Horizontal gene transfer – 1025 -1028 bp DNA per year in the oceans
Uncultured Viruses 200 liters water 5 -500 g fresh fecal matter Concentrate and purify viruses Epifluorescent Microscopy Extract nucleic acids DNA/RNA LASL Sequence
Bioinformatics • BLASTagainst NR – blastx, tblastn, tblastx • BLAST against boutique databases – Complete phage genomes, ACLAME, Other libraries, 16 S • Parsing to present data in a useful format
BLAST and Parsing • http: //phage. sdsu. edu/blast • Submit BLAST to local and remote databases – Local (as fast as possible) – NCBI (one search every 3 seconds) • Many concurrent searches – One search versus 1, 000 searches • Parse data into tables – Access to taxonomy etc
Most Viral Genes are Unknown Known 22% Unknown 78% TBLAST (E<0. 001) 3, 093 sequences Breitbart (2002) PNAS Rohwer (2003) Cell
Gen. Bank has more than doubled since 2001 … 60 billion base pairs 60 million sequences
Gen. Bank has more than doubled since 2001 … but the fraction of unknowns remains constant Edwards (2005) Nature Rev. Microbiol.
All of the new genes in the databases are coming from environmental sequences
Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future?
Human-associated viruses • More bacteria than somatic cells by at least an order of magnitude • More phages than bacteria by an order of magnitude • Sample the bacteria in the intestine by sampling their phage
Most Viral DNA Sequences in Adult Human Feces are Unknown Phages Eukaryotic Viruses 6% Known 40% Unknown 60% TBLAST (E<0. 001) 532 sequences Phages 94% Breitbart (2003) J. Bacteriol.
Adults Versus Babies No bacteria or viruses in 1 st fecal sample Abundant bacterial and viral communities by 1 week of age >108 VLP/g feces
Baby Feces Viruses • Most sequences are unknown (≈70%) • Similarities to phages from Lactococcus, Lactobacillus, Listeria, Streptococcus, and other Gram positive hosts • From microarray studies, sequences are stable in the baby over a 3 month period • Same types of phage as present in adult feces – one identical sequence to an unrelated adult!
DNA viruses in feces are phages. Feces ≠ intestines. RNA viruses?
Most Human RNA Viruses are Known Unknown 8% Known 92% TBLAST (E<0. 001) ≈36, 000 sequences Other Plant Viruses 9% Pepper Mild Mottle Virus 65% Other 26% Zhang (2006) PLo. S Biology
Pepper Mild Mottle Virus (PMMV) • ss. RNA virus; ≈6 kb genome • Related to Tobacco Mosaic Virus • Infects members of Capsicum family • Widely distributed – spread through seeds • Fruits are small, malformed, mottled • Rod-shaped virions Viral particles in fecal sample TOBACCO MOSAIC VIRUS http: //www. rothamsted. bbsrc. ac. u k/ppi/links/pplinks/virusems/
PMMV is common in Human Feces Fecal samples Extract total RNA RT-PCR for PMMV S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 PMMV San Diego : 78% people are positive Singapore : 67% people are positive 10 -50 fold increase in feces compared to food 106 -109 PMMV copies per gram dry weight of feces
ian tar ge ng Ko i ch il en gre ch ili ce au ili ch is hil ng c d foo Ko se e ric red NOT FOUND IN FRESH PEPPERS Ve Ho ng Ch ine en dle rry cu no o Ch ick Po rk ian Ind Which Foods Contain PMMV? Chili powder Chili sauces
Koch’s Postulates Thesunmachine. net http: //www. sweatnspice. com
Human microbial metagenome is more important than human genome
Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future?
Ev 40 er Th , 0 yth 00 in is g is se s so qu o en far 20 04 ce fro s m How do you sequence the environment? • Extract DNA • Create library • Sequence fragments
How do you sequence the environment? • Extract DNA • Pyrosequence
454 Pyrosequencing • DNA extraction from environment • Whole genome amplification • Emulsion-based PCR • Luciferase-based sequencing } SDSU } 454 Inc. Margulies (2005) Nature
454 Sequence Data (Only from Rohwer Lab) • 21 libraries – 10 microbial, 11 phage • 597, 340, 328 bp total – 20% of the human genome – 50% of all complete and partial microbial genomes • 5, 769, 035 sequences – Average 274, 716 per library • Average read length 103. 5 bp – Av. read length has not increased in 7 months
Growth of sequence data 600 million bp 6 million reads
Cost of sequencing • • • One reaction = $10, 000 One reaction = 250, 000 reads 250 reads = $10 1 read = 4¢ 454 sequencing does 1 read = 100 bp cot require cloning, arraying 1 bp = 0. 04¢ etc. ($400 per 1 x 1, 000 bp) • Sanger sequencing ca. $1/rxn, 0. 2¢/bp – real cost ca. $5/rxn, 1¢/bp
Bioinformatics • 597, 340, 328 bp total • 5, 769, 035 sequences • 7 months • Existing tools are not sufficient
Current Pipeline http: //phage. sdsu. edu/~rob/Pyrosequencing/ • Dereplicate • BLAST against – 16 S – Complete phage – nr (SEED) – subsystems
Sequencing is cheap and easy. Bioinformatics is neither.
Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future?
The Soudan Mine, Minnesota Red Stuff Black Stuff Oxidized Reduced
Red and Black Samples Are Different Black stuff Cloned and 454 sequenced 16 S are indistinguishable Cloned Red
Annotation of metagenomes by subsystems A subsystem is a group of genes that work together – Metabolism – Pathway – Cellular structures – Anything an annotator thinks is interesting
There are different amounts of metabolism in each environment
There are different amounts of substrates in each environment Red Stuff Black Stuff
But are the differences significant? • Sample 10, 000 proteins from site 1 • Count frequency of each subsystem • Repeat 20, 000 times • Repeat for sample 2 • Combine both samples • Sample 10, 000 proteins 20, 000 times • Build 95% CI • Compare medians from sites 1 and 2 with 95% CI Rodriguez-Brito (2006). In Review
Examples of significantly different subsystems Red Stuff Arg, Trp, His Ubiquinone FA oxidation Chemotaxis, Flagella Methylglyoxal metabolism Black Stuff Ile, Leu, Val Siderophores Glycerolipids Ni. Fe hydrogenase Phenylpropionate degradation
Subsystem differences & metabolism Iron acquisition Black Stuff Siderophore enterobactin biosynthesis ferric enterobactin transport ABC transporter ferrichrome ABC transporter heme Black stuff: ferrous iron (Fe 2+, ferroan [(Mg, Fe)6(Si, Al)4 O 10(OH)8]) Red stuff: ferric iron (goethite [Fe. O(OH)])
Nitrification differentiates the samples Edwards (2006) In review
Not all biochemistry happens in a single organism Anaerobic methane oxidation Boetius et al. Nature, 2000. CH 4 + SO 42 - -> HCO 3 - + HS- + H 2 S Archaea CH 4 + H 2 O -> HCO 3 - + OH + H 2 -> CO 2 + H 2 O Bacteria SO 42 - + H 2 O -> HS- + OH + 2 O 2
The challenge is explaining the differences between samples Red Sample Arg, Trp, His Ubiquinone FA oxidation Chemotaxis, Flagella Methylglyoxal metabolism Black Sample Ile, Leu, Val Siderophores Glycerolipids Ni. Fe hydrogenase Phenylpropionate degradation
We are moving away from one organism one reaction and towards studying the biochemistry of whole environments Bacteria don’t live alone
Summary From 454 sequence: – Identify microbial composition – Identify metabolic function – Identify statistically significant differences in metabolism – Who, what, why of microbial ecology
Metazoan associated Sampling Sites Marine Near-shore water (~100 samples) Off-shore water (~50 samples) Near- and off-shore sediments Corals Fish Human blood Human stool Freshwater Aquifer Glacial lake Extreme Terrestrial/Soil Amazon rainforest Konza prairie Joshua Tree desert Singapore Air Hot springs (84 o. C; 78 o. C) Soda lake (p. H 13) Solar saltern (>35% salt)
FIG SDSU Forest Rohwer Mya Breitbart Beltran Rodriguez-Brito Rohwer Lab: Linda Wegley Florent Angly Matt Haynes Also at SDSU Anca Segall Willow Segall Stanley Maloy Genome Institute of Singapore: Zhang Tao Charlie Lee Chia Lin Wei Yijun Ruan MIT: Ed De. Long Veronika Vonstein Ross Overbeek Annotators Math Guys@SDSU Peter Salamon Joe Mahaffy James Nulton Ben Felts David Bangor Steve Rayhawk Jennifer Mueller NSF - Biotic Surveys and Inventories - Biological Oceanography - Biocomplexity
Viral Community Structure • Contigs assembled from fragments with >= 98% identity over 20 bp are a resampling of a single phage genome • Contig specturm is the number of contigs that have one sequence, the number that have two sequences, and so on • Use both analytical and Monte-Carlo simulations to predict community structure from contig spectrum The Math Guys (2006) In preparation
Determine the actual contig spectrum of the sample Predict a contig spectrum using a species abundance model Continue this procedure until we obtain the smallest error Compute the error between the actual and predicted Adjust the parameters in the species abundance model to minimize errors Error Model parameters Find the smallest error, a global minimum
Viral Communities are Extremely Diverse Fecal Seawater Marine Sediments Lots of rare viral genotypes
Sediment Viruses Seawater Viruses Fecal Viruses Shannon. Wiener Index Bacteria on Corals Agriculture Soil Bacteria Soil Nematodes Cropland Earthworms Rainforest Spiders Amazon Fish Rainforest Birds Forest Mammals River Bacteria Temperate Forest Beetles Forest Amphibians Fossil Corals