Benchmarking JChem Oracle and Instant-JChem and more Tobias

Benchmarking JChem Oracle and Instant-JChem (and more) Tobias Kind Fiehn. Lab at UC Davis Genome Center November 2006 Free Academic Licenses for JChem and Instant JChem provided by 1

Chem. Axon product suite Source: Chemaxon. com We have free academic licenses for all products 2

Metabolomics @ Fiehnlab- The science of the small molecules Compound Classes: • sugars • amino acids • steroids • fatty acids • lipids • phospholipids • organic acids. . . Molecules under investigation (shown with Chem. Axon Marvin) Visit us @ fiehnlab. ucdavis. edu 3 D model of a molecule with surface plot (shown with Chem. Axon Marvin. Space) 3

Metabolomics is a truly emerging science. . . tries to identify all small molecules (< 2000 Da) in all life forms in a comprehensive manner Life Science Tree: Genomics (DNA) Transcriptomics (RNA) Proteomics (Proteins) Metabolomics (Small Molecules) 4

Techniques and tools • Analytical techniques (LC-MS, GC-MS, NMR, IR) • Bio. Informatics, Cheminformatics LTQ-FT-MS Gas Chromatography GC-TOF-MS Liquid Chromatography LC-MS Bio. Informatics and Cheminformatics Statistics (Statistica Dataminer) Open Source + commercial software 5

We use cheminformatics tools for mass spectrometry based structure elucidation See our BMC Bioinformatics paper: Metabolomic database annotations via query of elemental compositions: Mass accuracy is insufficient even at less than 1 ppm ; http: //www. biomedcentral. com/1471 -2105/7/234 6

What are JChem and Instant-JChem? JChem and Instant JChem are cheminformatics tools for handling small molecule structures together with substance data (log. P, fingerprint, p. Ka, toxicity, metainformation) + searches + filter + web connections and more Difference: JChem = complex package and Instant-JChem = one single tool Picture Chem. Axon JChem Instant-JChem 7

Benchmarking Instant-JChem and JChem Oracle (and more) Myth 1: JChem+Oracle is faster than Instant-JChem+Apache Derby – Reality: lets see. . . Myth 2: JAVA is slow – Reality: Its fast (70% of C++). Myth 3: Old Intel Netbust Xeons (Netburst) are slow – Reality: Yes. Myth 4: Oracle is a hazzelfree and handsome DB for beginners – Reality: Myth 6: 2 CPUs are better than one – Reality: Yes. Myth 7: Comparing apples with oranges (in germany pears) is unfair - c'mon. . . Only first myth left. 8

Happy Oracle Ace paid 10 K for certificate A bit of Oracle Reality Oracle works, lots of people invested lots of mony (ORCL market cap = 92 billion dollars) Its good for large data (TByte) - Its overkill for a small DB. If you plan to install it on your production workstation (a big No No) • It will eat 600 -800 MB of your valuable RAM (for nothing, on WINXP 32 bit) • It will create 15, 049 files in 2, 029 folders (for what? ) • It will create a lot of hassle with certain network setups (DHCP) • RTFM (read the … manual) is no joke and you need to learn SQL (try the free Aqua Data Studio) • Complete learning will take you 1. . 2 years, but gives you extreme flexibility If you plan to install JCHEM + Oracle you need • JChem (includes cartride for Oracle) • Oracle • Apache Tomcat • 1 -2 days time (Chem. Axon documentation is good, but too many things can go wrong with Oracle) 1 st time Oracle user 9

A bit of Instant JChem Reality v 1. 0 A) Download http: //www. chemaxon. com/instantjchem/ B) Install C) It Runs instantly • inbuilt Apache Derby DB • JAVA engine included • complete JChem included • out-of-the-box tool • can connect to other DBs 10

Importing Structures into Instant JChem During import in Instant JChem only one CPU works. The fingerprint calculation is probably not multi-threaded. (Solution: work pool = make pool for n CPUs) Short import time is critical for user convinience, but not for long term database projects. 11

Importing Structures into Instant-JChem influence of JAVA hotspot compiler JAVA VM runs in to modes: with client compiler and server compiler (directories under JRE) If you run any calculation intensive programs alwyas use server mode, in a batch file call java –server XYZ Good and fast Bad and slow 12

Influence of JAVA hotspot compiler Importing Structures into Instant-JChem Import of 250 k structures (NCI 99. smi) into Instant. JChem: Server JVM is 20% faster! lower is better Testsystem: Dual Opteron 254 (2, 8 GHz); WINXP-32 bit; 2. 88 GByte RAM (10 GByte/s transfer rate); ARECA-1120 RAID 5 (read/write 200 MByte/s and burst rate 500 MByte/s); QSOFT Ramdisk Enterprise 1, 2 GByte ( read write 1 GByte/s transfer) 13

Influence of JAVA hotspot compiler with Instant-JChem Task: Search for substructure in a 3 million compound database and calculate the Lipinski Rule of 5 on all the 4632 results. JAVA server mode: JAVA client mode: 15 seconds (30% faster) 21 seconds SMILES: NC 1=CC=NC 2=C 1 C=CC(Cl)=C 2 If you want to speed-up this query you need to pre-calculate and include all descriptors already in the database http: //en. wikipedia. org/wiki/Lipinski's_Rule_of_Five (mass() <= 500) && (log. P() <= 5) && (donor. Count() <= 5) && (acceptor Count() <= 10) (acceptor count for C and H) 14

Influence of number of CPUs with Instant-JChem Task: Search for a substructure in a 3 million compound database and calculate the Lipinski Rule of 5 on all the 4632 results 2 CPUs 1 CPU JAVA server mode: 15 seconds 33 seconds JAVA client mode: 21 seconds 44 seconds Doing the Lipinski utilizes both CPU cores! Try Intel Quad! Try Opteron 8 x! Testsystem: Dual Opteron 254 (2, 8 GHz); WINXP-32 bit; 2. 88 GByte RAM (10 GByte/s transfer rate); ARECA-1120 RAID 5 (read/write 200 MByte/s and burst rate 500 MByte/s); QSOFT Ramdisk Enterprise 1, 2 GByte ( read write 1000 MByte/s transfer) 15

Influence of number of CPUs with Instant-JChem Task: Search for a substructure in a 3 million compound database and calculate the Lipinski Rule of 5 on all the 4632 results (on the fly) 1 CPU (1 x 2. 8 GHz)* 2 CPUs (1 x 2. 8 GHz)* 8 CPUs** (2 GHz) 33 seconds 15 seconds 4 seconds Doing the Lipinski utilizes multiple CPU cores! However a single log. P calculation is dependent on CPU speed, not CPU cores. Use AMD Opteron 8 x. CPU systems (or better). For cheaper setups use Intel Core 2 Quad (QX 6700). Testsystem*: Dual Opteron 254 (2. 8 GHz); WINXP-32 bit; 2. 88 GByte RAM (10 GByte/s transfer rate); Testsystem** : 4 x Dual-Core Opteron 870 2. 0 GHz; Cent. OS 64 -bit, 32 GByte RAM, 3. 5 GB set for JAVA heap space 16

Influence of number of CPUs on complex calculations with Instant-JChem Task: Search in 1000 compounds from Pub. Chem-1000 -demo and calculate on-the-fly: Hits 1 CPU (1 x 2. 8 GHz)* 2 CPUs (1 x 2. 8 GHz)* 8 CPUs** (2 GHz) Bioavailability 832 30 s 17 s 7. 5 s Ghose filter 255 14 s 8 s 4. 4 s Lead likeness 531 53 s 25 s 9. 8 s Lipinski rule of 5 776 15 s 7. 5 s 4. 7 s Muegge filter 277 7. 5 s 4. 2 3. 4 s Veber filter 774 1. 7 s 1. 5 2. 5 s Take home message: The more complex the request – the more CPUs you need. The lead likeness has 7 filters and reaches a 5 -8 times speed-up with more CPUs. Testsystem*: Testsystem** : Dual Opteron 254 (2, 8 GHz); WINXP-32 bit; 2. 88 GByte RAM (10 GByte/s transfer rate); 4 x Dual-Core Opteron 870 2 GHz; Cent. OS 64 -bit, 32 GByte RAM, 3. 5 GB set for JAVA heap space 17

Scaling complex calculations to larger DBs with Instant-JChem Task: Now search in 250, 000 compounds from NCI 2000 and calculate on the fly: Hits Direct Query Calculation 8 CPUs** (2 GHz) extrapolated time from 1000 er DB Bioavailability 227, 997 <1 s 380 s 2055 s 5 Ghose filter 160, 047 <1 s 230 s 2762 s 12 Lead likeness 159, 656 <1 s 1255 s 2947 s 2 Lipinski rule of 5 199, 821 <1 s 176 s 1210 s 7 Muegge filter 145, 234 <1 s 299 s 1783 s 6 Veber filter 215, 377 <1 s 20 s 696 s 35 Take home message: Obtained speed-up Do not extrapolate calculational times from different or smaller DBs. The speedups here are 2 -35 larger than expected. Pre-calculate values once and store them in the DB and query values later. Testsystem** : 4 x Dual-Core Opteron 870 2 GHz; Cent. OS 64 -bit, 32 GByte RAM, 3. 5 GB max set for JAVA heap space 1. 5 GByte JAVA heap space used. 18

Derby database file sizes for Instant- JChem+Apache Derby Compounds only 100 k structures 1 Mio structures 10 Mio structures 20 Mio structures ~30 MByte ~300 MByte ~3 GByte ~6 Gbyte If you have dual or quad cores turn drive compression on. You can save almost 50% space, speed overhead is low. 19

Instant-JChem on disk based and RAMDisk based systems People who said the OS has efficient disk caching lied. A large RAMDISK can speed up your system extremely. A) If you have money – buy a Solid State Disk RAMSAN-400; 128 GByte; Price $252, 720 3, 000 MB/s random sustained external throughput. B) If you have some money – buy a RAID 5 card. ARECA ARC-1120 for 8 HDs, Price $500 200 -400 MB/s read and write access C) If you have litte money – buy a RAMDISK and stuff as much RAM in as possible (take a 64 -bit OS) 500 -1000 MB/s read and write access . . . a normal hard drive has ~30 -50 MB/s transfer rate 20

Instant-JChem on disk based and RAMdisk based system A) Heap Memory max 800 MByte (OK) Load 3 Mio compound DB from Ramdisk: Load 3 Mio compound DB from RAID 5 disk: 2 seconds 11 seconds (factor 5) Search Substructure from RAMDISK DB: Search Substructure from RAID 5 DB: instant (imemory buffered) instant (memory buffered) B) Heap Memory max 200 MByte (too low) Load 3 Mio compound DB from Ramdisk: Load 3 Mio compound DB from RAID 5 disk: 19 seconds 25 seconds (factor 1. 3) Search Substructure from RAMDISK DB: Search Substructure from RAID 5 DB: 22 seconds 38 seconds (factor 1. 7) Take home message: give JAVA (JChem) as much heap memory as you can. For 3 Million structures you need minimum 300 MByte heap space. No Heap memory: Performance degradation: Everything must be read from disk; My RAID 5 is already extremely fast, still the RAMDISK is even faster 21

JChem+Oracle DB on Xeon vs. Instant-JChem+Apache Derby DB on Opteron (apples vs. oranges) Task: Import and indexing 3 million compounds (NCI 2000 duplicated to 3 Mio) 3 GHz Dual Xeon with 2 GB system memory - JChem+Oracle DB = 5801 seconds (96 minutes) 2. 8 GHz Dual Opteron with 2, 88 GB memory - Instant-JChem+Apache Derby = 5333 seconds (88 minutes) Take home message: If you have a (modest) modern computer it can handle JChem and Instant-JChem and a local database can be faster than a remote database Source Xeon data: Oracle Cartridge Benchmark http: //www. chemaxon. com/jchem/FAQ. html#benchmark 3 22

Instant-JChem+Apache Derby DB on Socrates* vs. Instant-JChem+Apache Derby DB on Dual Opteron 2. 8 GHz (WIN-XP)** vs. JChem+Oracle DB on Dual Xeon 3 GHz (W 2003 Server)*** (more apples vs. oranges) Task: Search for a substructures in a 3 million compound database (NCI 2000 x 12) # Hits Instant. JChem+Derby** JChem+Oracle*** 0 0 O=C 1 ONC(N 1 c 2 ccccc 2)c 3 ccccc 3 204 0 0 0 [#6]-c 1 cc(-[#6])nc(NS(=O)c 2 ccccc 2)n 1 1224 0 0 0 c 1 ncc 2 ncnc 2 n 1 65, 208 2 s 7 s 14 s Clc 1 ccccc 1 274, 608 5 s 15 s 43 s O=Cc 1 ccccc 1 443, 580 9 s 28 s 85 s C 1 CN 1 c 2 cnnc 3 c(cncc 23)C 4=CSC=C 4 Take home message: Instant-JChem is fast (nothing more). Source: Instant-JChem (own system), JChem (Chem. Axon website) Socrates*: 4 x Dual Opteron 870 2 GHz; Cent. OS 64 -bit, 32 GByte RAM, 4 GB set for JAVA Opteron**: Dual Opteron 254 (2, 8 GHz); WINXP-32 bit; 2. 88 GByte RAM (10 GByte/s transfer ); ARECA-1120 RAID 5 (read/write 200 MByte/s and burst rate 500 MByte/s); QSOFT Ramdisk Enterprise 1, 2 GByte ( read write 1000 MByte/s transfer) Xeon: Dual Intel Xeon 3 GHz, 2 GB memory, 160 GB IDE hard drive; Windows 2003 5. 2; Oracle 9. 2. 0. 7. 0 DB buffer 1 GB; 1. 5. 0_06 -b 05 Apache Tomcat/5. 5. 12 23

A 20 million compound DB with Instant-JChem in a local Derby DB (Win. XP-32 bit) • Import is heavily disk dependent • several hundred million read/write operations to disk (JAVA writes in 4 KB chunks) • JAVA heap space used during import is around 600 MByte • import time is not linear anymore • WIN XP 32 -bit + NTFS desperatly try to cache the 6 GByte database file, even if there is only 3 GByte memory maximum available (1 GByte max for cache). • index creation (import smiles): 20 h (too long) • open index for search: 1 min • substructure search: > 1 min (to long) • 20 Mio currently to large for Instant-JChem v 1. 0 use JChem+Oracle (or My. SQL, MS SQL) • Aim: Full Pub. Chem data (15 -20 Mio) locally 24

Some general JAVA + JChem speed advices 1. Always use server JVM (check directory binclient and binserver) check batch or sh file options for JAVA –server xyz. jar 2. Use 64 -bit systems; the JAVA maximum heap space for LINUX or WIN as 32 -bit system is only 1. 6 GByte -Xms=1600 m 3. Use only multicore machines (AMD Opterons, Intel Quad) 4. Use the fastest disks you can buy (WD Raptor) or use RAID 5 or RAID 6 for large files (Pub. Chem SDF data for 5 Mio compounds = 30 GByte) 5. Give Instant-JChem as much memory as you have - minimum 500 MByte for extreme speed (no wait time for searches) 25

Let’s not forget competitors Many good systems exist: MDL (ISIS Base), ACDLabs (ACD/Chem. Folder Enterprise), Tripos (Sybyl+Auspyx), Molecular Networks (Carol), CDK and Taverna, Accelrys (Accord), Daylight (Thor and Merlin), Cambridge. Soft (Chem. Office Enterprise), Molsoft (ICM+Mol. Cart) Why is Chem. Axon better? Two reasons: • • The programs work under WINDOWS and LINUX Chem. Axon has the best and most responsive public forum: Critics is taken seriously, requested features are implemented ASAP, and a public response within 1 -3 days. WHY? Many commercial licencees. Remember, for academics all free. 26

Results and conclusion JChem Oracle vs. Instant-JChem 1. Instant-JChem+Derby is as fast or faster than JChem+Oracle for DBs < 3 Mio 2. If you want to have fun and results at your fingertip: Instant-JChem 3. If you want extreme flexibility and you know JAVA+SQL: JChem-Oracle 4. We are far away from handling billions of structures in a DB (with modest efforts) We will handle such large number of structures file stream based with cluster support. 5. Software producers (in general) need to put more efforts into software development for multi-core CPUs + clusters under Windows and LINUX. 27