Challenges for metagenomic data analysis and lessons from

Скачать презентацию Challenges for metagenomic data analysis and lessons from

ab79bb9d1cfd6db496438770bd69fca5.ppt

Количество слайдов: 64

Challenges for metagenomic data analysis and lessons from viral metagenomes [What would you do if sequencing were free? ] Rob Edwards San Diego State University Fellowship for Interpretation of Genomes The Burnham Institute for Medical Research

Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future?

Why Metagenomics? • What is there? • How many are there? • What are they doing?

How do you sequence the environment? • Extract DNA

Cs. Cl step gradient 1. 1 g ml-1 1. 35 g ml-1 1. 7 g ml-1

Cs. Cl step gradient

How do you sequence the environment? • Extract DNA • Create library

Linker-Amplified Shotgun Libraries (LASLs) Soil Extraction Kit This method produces high coverage libraries of over 1 million clones from as little as 1 ng DNA David Mead - Breitbart (2002) PNAS

How do you sequence the environment? • Extract DNA • Create library • Sequence fragments

Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future?

Why Phages? • Phages are viruses that infect bacteria – 10: 1 ratio of phages: bacteria – 1031 phages on the planet • Specific interactions (probably) – one virus : one host • Small genome size – Higher coverage • Horizontal gene transfer – 1025 -1028 bp DNA per year in the oceans

Uncultured Viruses 200 liters water 5 -500 g fresh fecal matter Concentrate and purify viruses Epifluorescent Microscopy Extract nucleic acids DNA/RNA LASL Sequence

Bioinformatics • BLASTagainst NR – blastx, tblastn, tblastx • BLAST against boutique databases – Complete phage genomes, ACLAME, Other libraries, 16 S • Parsing to present data in a useful format

BLAST and Parsing • http: //phage. sdsu. edu/blast • Submit BLAST to local and remote databases – Local (as fast as possible) – NCBI (one search every 3 seconds) • Many concurrent searches – One search versus 1, 000 searches • Parse data into tables – Access to taxonomy etc

Most Viral Genes are Unknown Known 22% Unknown 78% TBLAST (E<0. 001) 3, 093 sequences Breitbart (2002) PNAS Rohwer (2003) Cell

Gen. Bank has more than doubled since 2001 … 60 billion base pairs 60 million sequences

$Gen. Bank has more than doubled since 2001 … but the fraction of unknowns$ Gen. Bank has more than doubled since 2001 … but the fraction of unknowns remains constant Edwards (2005) Nature Rev. Microbiol.

All of the new genes in the databases are coming from environmental sequences

Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future?

Human-associated viruses • More bacteria than somatic cells by at least an order of magnitude • More phages than bacteria by an order of magnitude • Sample the bacteria in the intestine by sampling their phage

Most Viral DNA Sequences in Adult Human Feces are Unknown Phages Eukaryotic Viruses 6% Known 40% Unknown 60% TBLAST (E<0. 001) 532 sequences Phages 94% Breitbart (2003) J. Bacteriol.

Adults Versus Babies No bacteria or viruses in 1 st fecal sample Abundant bacterial and viral communities by 1 week of age >108 VLP/g feces

Baby Feces Viruses • Most sequences are unknown (≈70%) • Similarities to phages from Lactococcus, Lactobacillus, Listeria, Streptococcus, and other Gram positive hosts • From microarray studies, sequences are stable in the baby over a 3 month period • Same types of phage as present in adult feces – one identical sequence to an unrelated adult!

DNA viruses in feces are phages. Feces ≠ intestines. RNA viruses?

Most Human RNA Viruses are Known Unknown 8% Known 92% TBLAST (E<0. 001) ≈36, 000 sequences Other Plant Viruses 9% Pepper Mild Mottle Virus 65% Other 26% Zhang (2006) PLo. S Biology

Pepper Mild Mottle Virus (PMMV) • ss. RNA virus; ≈6 kb genome • Related to Tobacco Mosaic Virus • Infects members of Capsicum family • Widely distributed – spread through seeds • Fruits are small, malformed, mottled • Rod-shaped virions Viral particles in fecal sample TOBACCO MOSAIC VIRUS http: //www. rothamsted. bbsrc. ac. u k/ppi/links/pplinks/virusems/

PMMV is common in Human Feces Fecal samples Extract total RNA RT-PCR for PMMV S 1 S 2 S 3 S 4 S 5 S 6 S 7 S 8 S 9 PMMV San Diego : 78% people are positive Singapore : 67% people are positive 10 -50 fold increase in feces compared to food 106 -109 PMMV copies per gram dry weight of feces

ian tar ge ng Ko i ch il en gre ch ili ce au ili ch is hil ng c d foo Ko se e ric red NOT FOUND IN FRESH PEPPERS Ve Ho ng Ch ine en dle rry cu no o Ch ick Po rk ian Ind Which Foods Contain PMMV? Chili powder Chili sauces

Koch’s Postulates Thesunmachine. net http: //www. sweatnspice. com

Human microbial metagenome is more important than human genome

Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future?

Ev 40 er Th , 0 yth 00 in is g is se s so qu o en far 20 04 ce fro s m How do you sequence the environment? • Extract DNA • Create library • Sequence fragments

How do you sequence the environment? • Extract DNA • Pyrosequence

454 Pyrosequencing • DNA extraction from environment • Whole genome amplification • Emulsion-based PCR • Luciferase-based sequencing } SDSU } 454 Inc. Margulies (2005) Nature

454 Sequence Data (Only from Rohwer Lab) • 21 libraries – 10 microbial, 11 phage • 597, 340, 328 bp total – 20% of the human genome – 50% of all complete and partial microbial genomes • 5, 769, 035 sequences – Average 274, 716 per library • Average read length 103. 5 bp – Av. read length has not increased in 7 months

Growth of sequence data 600 million bp 6 million reads

Cost of sequencing • • • One reaction = $10, 000 One reaction = 250, 000 reads 250 reads = $10 1 read = 4¢ 454 sequencing does 1 read = 100 bp cot require cloning, arraying 1 bp = 0. 04¢ etc. ($400 per 1 x 1, 000 bp) • Sanger sequencing ca. $1/rxn, 0. 2¢/bp – real cost ca. $5/rxn, 1¢/bp

Bioinformatics • 597, 340, 328 bp total • 5, 769, 035 sequences • 7 months • Existing tools are not sufficient

Current Pipeline http: //phage. sdsu. edu/~rob/Pyrosequencing/ • Dereplicate • BLAST against – 16 S – Complete phage – nr (SEED) – subsystems

Sequencing is cheap and easy. Bioinformatics is neither.

Outline • How and why we sequence environments • Viral metagenomics – Marine stories – Human stories • Pyrosequencing – Mine story • Is there a Future?

The Soudan Mine, Minnesota Red Stuff Black Stuff Oxidized Reduced

Red and Black Samples Are Different Black stuff Cloned and 454 sequenced 16 S are indistinguishable Cloned Red

Annotation of metagenomes by subsystems A subsystem is a group of genes that work together – Metabolism – Pathway – Cellular structures – Anything an annotator thinks is interesting

There are different amounts of metabolism in each environment

There are different amounts of substrates in each environment Red Stuff Black Stuff

But are the differences significant? • Sample 10, 000 proteins from site 1 • Count frequency of each subsystem • Repeat 20, 000 times • Repeat for sample 2 • Combine both samples • Sample 10, 000 proteins 20, 000 times • Build 95% CI • Compare medians from sites 1 and 2 with 95% CI Rodriguez-Brito (2006). In Review

Examples of significantly different subsystems Red Stuff Arg, Trp, His Ubiquinone FA oxidation Chemotaxis, Flagella Methylglyoxal metabolism Black Stuff Ile, Leu, Val Siderophores Glycerolipids Ni. Fe hydrogenase Phenylpropionate degradation

Subsystem differences & metabolism Iron acquisition Black Stuff Siderophore enterobactin biosynthesis ferric enterobactin transport ABC transporter ferrichrome ABC transporter heme Black stuff: ferrous iron (Fe 2+, ferroan [(Mg, Fe)6(Si, Al)4 O 10(OH)8]) Red stuff: ferric iron (goethite [Fe. O(OH)])

Nitrification differentiates the samples Edwards (2006) In review

Not all biochemistry happens in a single organism Anaerobic methane oxidation Boetius et al. Nature, 2000. CH 4 + SO 42 - -> HCO 3 - + HS- + H 2 S Archaea CH 4 + H 2 O -> HCO 3 - + OH + H 2 -> CO 2 + H 2 O Bacteria SO 42 - + H 2 O -> HS- + OH + 2 O 2

The challenge is explaining the differences between samples Red Sample Arg, Trp, His Ubiquinone FA oxidation Chemotaxis, Flagella Methylglyoxal metabolism Black Sample Ile, Leu, Val Siderophores Glycerolipids Ni. Fe hydrogenase Phenylpropionate degradation

We are moving away from one organism one reaction and towards studying the biochemistry of whole environments Bacteria don’t live alone

Summary From 454 sequence: – Identify microbial composition – Identify metabolic function – Identify statistically significant differences in metabolism – Who, what, why of microbial ecology

Metazoan associated Sampling Sites Marine Near-shore water (~100 samples) Off-shore water (~50 samples) Near- and off-shore sediments Corals Fish Human blood Human stool Freshwater Aquifer Glacial lake Extreme Terrestrial/Soil Amazon rainforest Konza prairie Joshua Tree desert Singapore Air Hot springs (84 o. C; 78 o. C) Soda lake (p. H 13) Solar saltern (>35% salt)

FIG SDSU Forest Rohwer Mya Breitbart Beltran Rodriguez-Brito Rohwer Lab: Linda Wegley Florent Angly Matt Haynes Also at SDSU Anca Segall Willow Segall Stanley Maloy Genome Institute of Singapore: Zhang Tao Charlie Lee Chia Lin Wei Yijun Ruan MIT: Ed De. Long Veronika Vonstein Ross Overbeek Annotators Math Guys@SDSU Peter Salamon Joe Mahaffy James Nulton Ben Felts David Bangor Steve Rayhawk Jennifer Mueller NSF - Biotic Surveys and Inventories - Biological Oceanography - Biocomplexity

Viral Community Structure • Contigs assembled from fragments with >= 98% identity over 20 bp are a resampling of a single phage genome • Contig specturm is the number of contigs that have one sequence, the number that have two sequences, and so on • Use both analytical and Monte-Carlo simulations to predict community structure from contig spectrum The Math Guys (2006) In preparation

Determine the actual contig spectrum of the sample Predict a contig spectrum using a species abundance model Continue this procedure until we obtain the smallest error Compute the error between the actual and predicted Adjust the parameters in the species abundance model to minimize errors Error Model parameters Find the smallest error, a global minimum

Viral Communities are Extremely Diverse Fecal Seawater Marine Sediments Lots of rare viral genotypes

Sediment Viruses Seawater Viruses Fecal Viruses Shannon. Wiener Index Bacteria on Corals Agriculture Soil Bacteria Soil Nematodes Cropland Earthworms Rainforest Spiders Amazon Fish Rainforest Birds Forest Mammals River Bacteria Temperate Forest Beetles Forest Amphibians Fossil Corals