- Количество слайдов: 20
Data Harvesting: automatic extraction of information necessary for the deposition of structures from protein crystallography Martyn Winn CCP 4, Daresbury Laboratory
Progress of a PX project Structure Deposition Data Collection (PDB @ RCSB, EBI) (synchrotron, home source) Structure Solution (CCP 4 etc. ) Database Queries
PROTEIN DATABANK F international repository for the processing and distribution of 3 -D macromolecular structure data determined experimentally by X-ray crystallography and NMR. F data deposited to PDB at RCSB (U. S. ) and EBI (U. K. )
USES OF PDB F Retrieval of data of single structure F Global searches (e. g. for molecule name, particular cofactor, etc. ) F Generating statistics (e. g. structures vs. resolution) F Derived databases (e. g. Re. Li. Base, scop/CATH)
Examples of deposited information F F F F F Name of source organism Reference to sequence database entry Temperature of diffraction expt. No. of unique reflections Rmerge as function of resolution Starting model for molecular replacement Restraints used in refinement Identification of secondary structure elements Atomic coordinates and structure factor amplitudes
HARVESTING CONCEPT F Pioneered by EBI deposition centre. F Data Harvesting is a protocol for communicating relevant data from Software to Deposition Site F Why? H H More reliable data Richer database
HARVEST: Action F Action of harvesting is entirely local. F A run of a program captures all significant information produced by that program run, and stores it in a (date-stamped) file. F Control of the contents of the harvest file is by the developers of the software being run and the researcher running it.
HARVEST: File Format F mm. CIF has been selected as the format to represent harvest (deposition) data items F several files are generated F mm. CIF relationships not necessarily maintained F ‘TRUE’ final complete mm. CIF file only generated after complete processing of a submission at the deposition site
Identifying harvesting files F Each run of a harvesting program produces a single file. F Files identified by Project Name and Dataset Name.
Project Name F Project Name is the individual’s in-house laboratory code for a structure that will eventually be deposited F Equivalent to a PDB idcode or _entry. id F E. g. u. A new native structure u. A mutant structure u. A ligand protein complex
Dataset Name is an individual’s code to represent each experiment carried out to solve a particular Project Name. F Equivalent to _diffrn. id F F E. g. u. Each wavelength in a MAD experiment u. Each Heavy atom derivative u. Each different NMR experiment carried out in the course of a structure determination
Management of harvest Files F CCP 4 Prototype uses a directory in $HOME to store harvest files with file names: $HOME/Deposit. Files/PName/DName. Prog. Name_mode F Files sent to EBI at time of deposition. F Ultimately the individual research worker is responsible for the management of their own data files.
HARVEST: Problems F Management H H H A structure may be solved by more than one user A structure may be solved using different machines not NFS connected More than one run and which run is FINAL? F Scope H H of harvesting files: of harvesting: Need to persuade software authors to adopt protocol Still need manual addition/checking of information
Implementation in CCP 4 F Harvesting l l [MOSFLM] (data processing) SCALA / TRUNCATE (data reduction) MLPHARE (phasing) RESTRAIN / REFMAC (refinement) F Associated l l files produced by: libraries: libccif - Peter Keller’s suite of routines to read and write mm. CIF files harvlib. f - Kim Henrick’s Fortran front end to libccif Public release - January 2000
Example: SCALA output (1) data_phosphate_binding_protein[A 197 C_chromophore_x] _entry. id phosphate_binding_protein _diffrn. id A 197 C_chromophore_x _audit. creation_date 1997 -10 -30 T 12: 43: 41+00: 00 _software. classification 'data reduction' _software. contact_author 'P. R. Evans' _software. contact_author_email [email protected] cam. ac. uk _software. description 'scale together multiple observations of reflections' _software. name Scala _software. version 'CCP 4_2. 2. 3 1/7/97'
Example: SCALA output (2) _diffrn_reflns. d_res_low 35. 36 _diffrn_reflns. d_res_high 3. 00 _diffrn_reflns. number_measured_all 17986 _diffrn_reflns. number_unique_all 6645 _diffrn_reflns. number_centric_all 363 _diffrn_reflns. number_anomalous_all 2348 _diffrn_reflns. Rmerge_I_anomalous_all 0. 050
User Input F For H H each program run, user can specify: Project Name Dataset Name USECWD - write harvest file to cwd rather than deposit directory; useful for trial runs NOHARVEST - do not write harvest file
Automation F All that program needs to know is Project Name and Dataset Name F This information carried between programs in header section of reflection file (MTZ file) F Information written to reflection file as soon as possible (ideally written to image files and passed on).
Current status F Harvesting software released as part of CCP 4 in January 2000. No harvesting files sent to EBI as yet (early days!) F CNS also produces harvesting files, and some use of these F Plans to extend to concept to data from NMR and EM
Acknowledgements F Kim Henrick, Peter Keller (EBI) F Eleanor Dodson, Phil Evans (CCP 4) F BBSRC http: //www. dl. ac. uk/CCP 4/newsletter 35/dataharvest. html http: //www. dl. ac. uk/CCP 4/newsletter 37/13_harvest. html