2b8d6076d8601274218c639cbdd1985b.ppt
- Количество слайдов: 46
Grids and Software Engineering Test Platforms Alberto Di Meglio CERN www. eu-etics. org INFSOM-RI-026753
Contents • • Setting the Context A “Typical” Grid Environment Challenges Test Requirements Methodologies The Grid as a Test Tool Conclusions Panel discussion on Grid QA and industrial applications INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 2
Setting the Context • What is a distributed environment? • The main characteristic of a distributed environment that affects how test are performed are: – Many things happen at all times in the same or different places and can have direct or indirect and often unpredictable effects on each other • The main goal of this discussion is to show you what are the consequences of this on testing the grid and how the grid can (must) be used as a tool to test itself and the software running on it INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 3
A “Typical” Grid Environment JSDL UNICORE Condor PBS LSF Condor DGAS DPM SRM 2. 1 d. Cache SRM 2. 0 Castor INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 4
Challenges • • Non-determinism Infrastructure dependencies Distributed and partial failures Time-outs Dynamic nature of the structure Lack of mature standards (interoperability) Multiple heterogeneous platforms Security INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 5
Non-determinism • Distributed systems like the grid are inherently non deterministic • Noise is introduced in many places (OS schedulers, network time-outs, process synchronization, race conditions, etc) • Changes in the infrastructure not controlled by a test have an effect on the test and on the sequence of tests • Difficult to exactly reproduce a test run INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 6
Infrastructure dependencies • Operating systems and third-party applications interact with the objects to be tested • Different versions of OSs and applications may behave differently • Software updates (especially security patches) cannot be avoided • Network topologies and boundaries may be under someone else control (routers, firewalls, proxies) INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 7
Distributed and Partial Failures • In a distributed systems also failures are distributed • A test or sequence of tests may fail because part of the system (a node, a service) fails or is unavailable • The nature of the problem can be anything: hardware, software, local network policy changes, power failures • In addition, since this is expected, middleware and applications should cope with that and their behaviour should be tested for it INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 8
Time-outs • Not necessarily due to a failure, but also to excessive load • They may be infrastructure-related (network), systemrelated (OS, service containers) or application-related • Services may react differently when time-outs occur: they may plainly fail, raise exceptions, have retry strategies • There are consequences of the tests sequence (nondeterminism again) INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 9
Dynamic nature of the structure • The type and number of actors and objects participating to the workflow change with time and location (concurrent users, different processes on the same machine, different machines across the infrastructure) • Middleware and applications may dynamically (re)configure themselves depending on local or remote conditions (for example load balancing or service failover) • Actual execution paths may change with load conditions • How to reproduce and track such configurations? INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 10
Moving Standards • Lack of or rapidly changing standards make it difficult for grid services to interoperate • Service-oriented architectures should make life easier, but which standard should be adopted? • Failures may be due to incorrect/incomplete/incompatible implementations • Ex 1: plain web services, WSRF, WS-*? • Ex 2: axis (j/c), gsoap, gridsite, zsi? • Ex 3: SRM, JSDL • How to test the potential combinations? INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 11
Multiple Heterogeneous Platforms • Distributed software, especially grid software, runs on a variety of platforms (combinations of OS, architecture and compilers) • Software is often written on a specific platform and only later ported on other platforms • OS and third-party dependencies may change across platforms in version and type • Different compilers usually do not compile the same code in the same way (if at all) INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 12
Security • Security and security testing are huge issues • Sometimes there is a tendency to consider security an add-on of the middleware or applications • Software behaves in completely different ways with and without security for the same functionality • Ex: consider the simple example of a web service running on http or https, with or without client certificates • Sometimes software is developed on individual machines without taking into account the constraints imposed by running secure network infrastructures INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 13
Test Requirements • • • Where to start from? Test Plans Life-cycle testing Reproducibility Archival and analysis Interactive Vs. automated testing INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 14
Test Plans • Test plans should be the mandatory starting point of all test activities. This point is often neglected • It is a difficult task • You need to understand thoroughly your system and the environment where it must be deployed • You need to spell out clearly what you want to test and how and what are the expected results • Write it together with domain experts to make sure as many system components and interactions as possible are taken into account • Revise it often INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 15
Life-cycle Testing • When designing the test plan, don’t think only about functionality, but also about how the system will have to be deployed and maintained • Start with explicit design of installation, configuration and upgrade tests: it is easy to see that a large part of the bugs of a system fall in the installation and configuration category g. Lite bugs categories INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 16
Reproducibility • This requirement addresses the issue of nondeterminism • Invest in tools and processes that makes your tests and your test environment reproducible • Install your machines using scripts or system management tools, but disable automated APT/YUM/up 2 date updates • Store the tests together with all information needed to run them (environment variables, properties, support files, etc) and use version control tools to keep the tests in synch with software releases INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 17
Reproducibility (2) • Resist the temptation of making too much debugging on your test machines (are testers supposed to do that? ) • If you can afford it, think of using parallel testbeds for test runs and debugging • Try and write a regression test immediately after the problem is found, record it in the test or bug tracking system and feed it back to the developers • Then scratch the machine and restart INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 18
Archival and Analysis • Archive as much information as possible about your tests (output, errors, logs, files, build artifacts, even an image of the machine itself if necessary) • If possible use a standard test output schema (the xunit schema is quite standard and can be used for many languages and for unit, functional and regression tests) • Using a common schema helps in correlating results, creating tests hierarchies, performing trend analysis (performance and stress tests) INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 19
Interactive Vs. Automated Tests • This is a debated issue (related to the reproducibility and debugging issues) • Some people say that the more complex a system and the less automated meaningful tests you can do • Other people say that the more complex a system and the more necessary it is to do automated tests • The truth is probably in between: you need both and whatever test tools you use should allow you to do both • A sensible approach is to run distributed automated tests using a test framework and freeze the machines where problems occur in order to do more interactive tests if the available output is not enough INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 20
Methodologies • • • Unit testing Metrics Installation and configuration ‘Hello grid world’ tests and ‘Grid Exercisers’ Functional and non-functional tests INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 21
Unit Testing • Unit tests are tests performed on the code during or immediately after a build • They should be independent from the environment and the test sequence • They are not used to test functionality, but the nominal behaviour of functions and methods • Unit tests are a responsibility of the developers and in some models (test-driven development) they should be written before the code • It is proven that up to 75% of the bugs of a system can in principle be stopped by doing proper unit tests • It is also proven than they are the first thing that is skipped as soon as a project is late (which normally happens within the initial 20% of its life) INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 22
Metrics • • • Another controversial point Metrics by themselves are not extremely useful However, used together with the other test methodologies they can provide some interesting information about the system g. Lite bug trends examples INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 23
Installation and Configuration • As mentioned, dedicate some time to test installation and configuration of the services • Use automated systems for installing and configuring the services (system management tools, APT, YUM, quattor, SMS, etc). No manual installations! • Tests upgrade scenarios from one version of a service to another • Many interoperability and compatibility issues are immediately discovered when restarting a service after an upgrade INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 24
‘Hello, grid world’ tests and ‘Grid Exercisers’ • Now you have an installed and configured service. So what? • A good way of starting the tests is to have a set of nominal ‘Hello, grid world’ tests and ‘Grid Exercisers’ • Such tests should perform a number of basic, blackbox tests, like submitting a simple job through the chain, retrieving a file from storage, etc • The tests should be designed to exercise the system from end to end, but without focusing too much on the internals of the system • No other tests should start until the full set of exercisers runs consistently and reproducibly in the testbed INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 25
Functional and Non-Functional Tests • At this point you can fire the full complement of: – – – Regression tests (verify that old bugs have not resuscitated) Functional tests (black and white box) Performance tests Stress tests End-to-end tests (response times, auditing, accounting) • Of course this should be done: – – for all services and their combinations on as many platforms as possible with full security in place using meaningful tests configurations and topologies INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 26
The Grid as a Test Tool • • • Intragrids Certification and Pre-Production environments Virtualization and the Virtual Test Lab Grid Test Frameworks State of the Art INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 27
Intragrids • Intragrids are becoming more common especially in commercial companies • An intragrid is a grid of computing resources entirely owned by a single company/institute, not necessarily in the same geographical location • Often they use very specific (enhanced) security protocols • They are often used as tools to increase the efficiency of a company internal processes • But there also cases of intragrids used as test tools • A typical example is the intragrid used by CPUs manufactures like Intel to simulate their hardware or test the compilers on multiple platforms. INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 28
Certification and Pre-Production • In order to test grid middleware and applications in meaningful contexts, the testbeds should be as close a reproductions as possible of real grid environments • A typical approach is to have Certification and Pre-Production environments designed as smaller-scale, but full-featured grids with multiple participating sites • A certification testbed is typically composed of a complete, but limited set of services, usually within the same network. It is used to test nominal functionality • A pre-production environment is a full-fledged grid, with multiple sites and services, used by grid middleware and application providers to test their software • A typical example is the EGEE pre-production environment where g. Lite releases and HEP or biomed grid applications are tested before they are released to production INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 29
Virtualization • As we have seen, the Grid must embrace diversity in terms of platforms, development languages, deployment methods, etc • However, testing all resulting combinations is very difficult and time consuming, not to mention the manpower required • Automation tools can help, but providing and especially maintaining the required hardware and software resources is not trivial • In addition running tests on clean resources is essential for enforcing reproducibility • A possible solution is the use of virtualization INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 30
The Standard Test Lab Test Framework Each test platform has to be preinstalled and maintained. Elevated-privileges tests cannot be easily done (security risks). Required for performance and stress tests INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 31
The Virtual Test Lab Images can contain preinstalled OSs in fixed, reproducible configurations Test Framework The testbed is only composed of a limited number of officially supported platforms Virtualization Software (XEN, MS Virtual Server, VMWare) It allows performing elevatedprivileges tests. Security risks are minimized, the image is destroyed when the test is over. But it can also be archived for later offline analysis of the tests INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 32
Grid Test Frameworks • A test framework is a program or a suite of programs that helps managing and executing tests and collecting the results • They go from low level frameworks like xunit (junit, pyunit, cppunit, etc) to full fledged grid-based tools like NMI, Inca and ETICS (more on this later) • It is recommended to use such tools to make the tests execution reproducible, to automate or replicate tasks across different platforms, to collect and analyse results over time • But remember one of the previous tenets: make sure your tests can be run manually and that the test framework doesn’t prevent that INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 33
State of the Art • • NMI Inca ETICS OMII-Europe INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 34
NMI • NMI is a multi-platform facility designed to provide (automated) software building and testing services for a variety of (grid) computing projects. • NMI is a layer on the top of Condor to abstract the typical complexity of the Build and Test process • Condor is offering mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributed computing resources INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 35
NMI (2) INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 36
NMI (3) INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 37
NMI (4) • Currently used by: – – – Condor Globus VDT INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 38
INCA • Inca is a flexible framework for the automated testing, benchmarking and monitoring of Grid systems. It includes mechanisms to schedule the execution of information gathering scripts and to collect, archive, publish, and display data • Originally developed for the Tera. Grid project • It is part of NMI INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 39
INCA (2) INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 40
ETICS Web Application Web Service Via browser Report DB Project DB Build/Test Artefacts NMI Scheduler Via command. Line tools Clients INFSOM-RI-026753 WNs NMI Client ETICS Infrastructure Grid School of Computing - 13 July 2006 - Ischia 41
ETICS (2) • Web Application layout (project structure) INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 42
ETICS (3) INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 43
ETICS (4) • Currently used or being evaluated by: – – – EGEE for the g. Lite middleware DILIGENT (digital libraries on the grid) CERN IT FIO Team (quattor, castor) • Open discussion ongoing with HP, Intel, Siemens to identify potential commercial applications INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 44
Conclusions • Testing for the grid and with the grid is a difficult task • Overall quality (ease-of-use, reliable installation and configuration, end-to-end security) is not always at the level that industry would find viable or cost-effective for commercial applications • It is essential to dedicate efforts to testing and improving the quality of grid software by using dedicated methodologies and facilities and sharing resources • It is also important to educate developers to appreciate the importance of thinking in terms of QA • However the prize for this effort would be a software engineering platform of unprecedented potential and flexibility INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 45
Panel discussion ? INFSOM-RI-026753 Grid School of Computing - 13 July 2006 - Ischia 46
2b8d6076d8601274218c639cbdd1985b.ppt