Reliability study of an embedded operating system for

Reliability study of an embedded operating system for industrial applications Pardo, J. , Campelo, J. C, Serrano, J. J. Juan Pardo Fault Tolerant Systems Group Polytechnic University of Valencia Spain 1

Research Objectives l Critical industrial applications or fault tolerant applications need for operating systems (OS) which guarantee a correct and safe behaviour despite the appearance of errors. l In order to validate the behaviour of an operating system in front of errors, software fault injection techniques can be used. l These techniques can be used to corrupt the information of some of the operating system calls to see how the system react in front of invalid or corrupted values at the kernel calls. SEPT’ 04 WSRS '04 2

Research Objectives l The research work presented is about the development and results on software fault injection in an embedded system composed by a Real-Time Operating System (RTOS) and a microcontroller. l A software fault injection tool has been developed. The methodology proposed treated the operating system as a black-box where its source code was not available. l With this objective a layer between the operating system and the application to be executed has been developed. l OS error detection coverage has been measured and observations about OS critical data structures to be improved have been commented, in order to improve the final robustness of the operating system. SEPT’ 04 WSRS '04 3

Introduction l Software of computer systems involves a lot of aspects of our lives. Despite their enormous expansion, they are still far from reaching the perfection. l In order to measure the quality of the software some tests are required. l Fault tolerance deals with software’s ability to hide problems, specifically the effects of faults [Voas 98]. [ l Robustness is the degree to which a system operates correctly in the presence of exceptional inputs or stressful environmental conditions. l Robustness can thus be viewed as an indication on the OS capacity to resist/react to faults induced by the applications running on top of it, or originating from the hardware layer or from device drivers [DBench 02]. [ SEPT’ 04 WSRS '04 4

Introduction l Fault Tolerant System l l Dependability l l SEPT’ 04 Fault tolerance is intended to preserve the delivery of correct service in the presence of active faults. It is generally implemented by error detection and subsequent system recovery A system able to continue working although the appearance of errors Safe behaviour known state which doesn’t produce any risk to the system To avoid the lost of human lives or important economic quantities Final products quality Validation before to go to the market WSRS '04 5

Introduction Dependability: Dependability of a computing system is the ability to deliver service that can justifiably be trusted Dependability Attributes Availability Reliability Safety Confidentiality Integrity Maintainability Means Fault prevention Fault tolerance Fault removal Fault forecasting Threats Faults Errors Failures A. Avizienis JC. Laprie B. Randell SEPT’ 04 WSRS '04 6

State of art Fault Injection Techniques Fault Injection FI on Simulated models VHDL Simulation models Other languages FI on prototypes Hardware Injection HWIFI External Software Injection SWIFI Time Level HWIFI at pin level Static High Level Electromagnetic Perturbations Dynamic Machine Language Internal Injection Objectives: Heavy ion radiations • Prediction Laser Radiation Scan Chain SEPT’ 04 • Elimination WSRS '04 7

Advantages & drawbacks (SWIFI ) l Total control on When and Where to inject Controllability l Higher level faults simulation l Reduced cost l Higher reachability l Higher portability Flexibility l Low risk to damage the circuit under tests l Easy automation of the injection campaigns l Good observability everyday processors have more internal tools for debugging SEPT’ 04 WSRS '04 8

Advantages & drawbacks (SWIFI ) l There are zones which SW can not reach. l Less precision on timing measurements interferences with the system, overload, etc. l Injection and activation agents overload the system l Runtime Injection Little intrusion l Objective: minimize the overload l l l Drawback for RTOS Easy automation of injections campaigns Pre runtime Less intrusion SEPT’ 04 WSRS '04 9

SW Fault Injection l SW Fault Injection tools: l l l l l FIAT: Fault Injection Based Automated Testing Environment, Carnegie Mellon University. EFI, PROFI: Processor Fault Injector, Dortmund University. FERRARI: Fault and ERRor Automatic Real-time Injector, Texas University. SFI, DOCTOR: intergrate. D s. Oftware implemented fault inje. CTi. On envi. Ronment, Michigan University. FINE: Fault Injection and mo. Nitoring Environment, Universidad de Illinois University. FTAPE: Fault Tolerance and Performance Evaluator, Illinois University. XCEPTION: Coimbra University. MAFALDA, MAFALDA RT: Microkernel Assessment by Fault injection Ana. Lysis and Design Aid, LAAS CNRS en Toulouse BALLISTA: Carnegie Mellon University. SEPT’ 04 WSRS '04 10

Tools l l l Micro. C/OS II RTOS Infineon C 166 Microcontroller Tasking Compiler, Debugger. . RAM 1 KByte XRAM 1 KByte CAN BUSCONTROL CORE ROM IR+PECPWM INTERRUPT UNIT CONTROL SSC WDT CAPCOM ADC GPT 1+2 USART 1+2 l. Infineon Microcontroller Characteristics: l 16 bits High performance l. On chip CMOS l 16. 5 MIPS, 25/33 MHz l. Advantages from CISC & RISC l. High functionality for peripheral l. Typical for automotive SEPT’ 04 WSRS '04 11

COTS components l The main motivation to use Commercial Off The Shelf (COTS) components on a system design is the notorious cost reduction associated to the final product development. l The use of COTS components becomes a cost effective method for rapid prototyping of complex software systems. l On the other hand, the use of COTS software components have serious certification problems due to their design process is unknown. SEPT’ 04 WSRS '04 12

COTS components l COTS software is composed of general purpose components which have poor dependability specifications. l Usually, COTS components are like a black box, the source code is not available and their internal architecture (structure and data flow) is not adequately documented. SEPT’ 04 WSRS '04 13

µC/OS-II Operating System l Selection came motivated from the perspective that it is a system widely used since several years ago. First Version Micro. C/OS 1992 l Industrial robots, motor control, medical instruments, etc. l It is 99% compliant with the Motor Industry Software Reliability Association (MISRA) C Coding Standards. l All Modified Condition Decision Coverage (MCDC) code in Micro. C/OS II has been removed, improving code quality for RTCA / EUROCAE DO 178 B Level A certified environments for avionics applications. Validated Software Comp. SEPT’ 04 WSRS '04 14

µC/OS-II: Characteristics l Portable: u. C/OS II is written in highly portable ANSI C, with target microprocessor specific code written in assembly language. l ROMable: was designed for embedded applications. This means that if you have the proper tool chain (i. e. , C compiler, assembler, and linker/locator), you can embed u. C/OS II as part of a product. l Scalable: it’s possible to use only the services needed in the application. This allows to reduce the amount of memory (both RAM and ROM) needed. Scalability is accomplished with the use of conditional compilation. l Preemptive: u. C/OS II is a fully preemptive real time kernel. This means that u. C/OS II always runs the highest priority task that is ready. l Multitasking: u. C/OS II can manage up to 64 tasks; however, the current version of the software reserves eight of these tasks for system use. This leaves your application up to 56 tasks. Each task has a unique priority assigned to it, which means that u. C/OS II cannot do round robin scheduling. SEPT’ 04 WSRS '04 Jean J. Labrosse 15

µC/OS-II: Characteristics l Deterministic: Execution time of all u. C/OS II functions and services are deterministic. You can always know how much time u. C/OS II will take to execute a function or a service. Further more execution time of allu. C/OS II services do not depend on the number of tasks running in your application. l Task Stacks: Each task requires its own stack; u. C/OS II allows each task to have a different stack size. This allows you to reduce the amount of RAM needed in your application. l Services: system services such as mailboxes, queues, semaphores, fixed sized memory partitions, time related functions, etc. l Interrupt Management: Interrupts can suspend the execution of a task. If a higher priority task is awakened as a result of the interrupt, the highest priority task will run as soon as all nested interrupts complete. Interrupts can be nested up to 255 levels deep. l Robust and Reliable: u. C/OS II is based on u. C/OS, which has been used in hundreds of commercial applications since 1992. SEPT’ 04 WSRS '04 Jean J. Labrosse 16

Black box approach l The aim of study was to use a black box approach for the OS study. l So the OS source code was not modified trying to avoid as maximum as possible an intrusion in the OS behaviour. l With this objective, a layer named as Meta Kernel, had been developed between the OS and the application to be executed. l Through this layer the fault injection was realized in any of the parameters of the system calls to measure the OS robustness. l In black box testing, input is fed into a program and the output is checked. What goes on inside the program (the black box) is unimportant. (Voas 98) ( COTS SW SEPT’ 04 WSRS '04 17

System Design l Micro. C/OS II OS Black Box l OS Source Code not modified l Injector Layer between the OS and the application l Injection on the parameters of system calls SEPT’ 04 WSRS '04 18

Injector Attributes: SOFTWARE FAULT INJECTION ATTRIBUTES • Prediction, elimination • Pre-runtime & Runtime Software Fault Injection • High Level • Transient faults • Changing of one bit at the system calls (Bit-Flip) • One fault injected each exp. • Workload for tool testing Objectives Fault Prediction Fault Removal Time Faults Pre-runtime Level Runtime Localization Persistence Multiplicity Number of simultaneously faults injected each experiment Workload Real Applications Benchmarks Synthetic Programs Type Duration SEPT’ 04 WSRS '04 19

Workload Design Characteristics: • Maximum system calls consume • System calls of synchronization, semaphores, memory, queues, messages, tasks handling, Timing management, etc. • Open module to include calculus. • Workload for testing the injection tool and the OS SEPT’ 04 WSRS '04 20

Workload Design l. The system workload was continuously running and consisted of a series of tasks executing the application. l. On the other hand, an injection agent developed was in charge of injecting faults and invalid values at the kernel calls in order to monitor the system robustness. SEPT’ 04 WSRS '04 21

Errors Classification After the Fault Injection Events after fault injection OS Error code C 167 Error code Detected Errors l l l Application Error System Call not used No Error (Correct result) Others ↓ Not Safe Faults (NFS) System Call used but injection no affects Errors which could affect the system Classification related to the detection mechanisms Measures about error detection coverage and latency times SEPT’ 04 WSRS '04 22

Injection Model l The faultload is the most critical dimension of an OS benchmark and more generally of any dependability benchmark. l Two techniques for system call parameter corruption could be used: the ‘bit flip technique’ consisting in flipping systematically bits of the target parameters l and the ‘selective substitution technique’ when invalid data values are introduced in the system call parameters. l Studies have demonstrated the equivalence of the errors provoked by the two techniques [Dbench 02]. [ SEPT’ 04 WSRS '04 23

Injection Model l BIT FLIP technique l It is randomly chosen on runtime: 1. System call 2. Parameter 3. Bit l Consequence of physical faults l l SEPT’ 04 EMI interferences Noise Hardware faults. . . WSRS '04 24

Analysis of the obtained results • Codification of the different output values: • D 0: No error, correct output (the fault injection didn’t affect the system). • D 1: Error detected by the operating system (µC/OS-II error code). • D 2: Error detected by the application (the application result was no correct). • D 3: Error which produced the system hangs. (System failure) • D 4: Error detected by the microcontroller. SEPT’ 04 WSRS '04 25

Analysis of the obtained results Coverage: [Powell 95, Constantinescu 95] Complete System (µC/OS-II + Micro): Micro C cs = D 0 + D 1 + D 2 + D 4 = 65, 7 + 21 + 2, 5 = 91, 2 % Operating System ( µC/OS-II ): C OS = D 0 + D 1 =86, 7 % SEPT’ 04 WSRS '04 26

Analysis of the obtained results l Error detection latencies l l l SEPT’ 04 Time between the injection and detection by the OS Mean value obtained 304 μs One built in timer of the microcontroller to measure latencies l High precision WSRS '04 27

Other Results ‘E 1’ was the most typical. This error is the ‘OS_ERR_EVENT_TYPE’. This error was produced when the fault was injected in some semaphore, message queue or mailbox. The system reacted going to a hanging state. Secondly, the error code ‘E 42’ related with the ‘OS_PRIO_INVALID’ was obtained when the injection was at system calls about task management. Valid data Frequency Percentage Accumulative percentage Error Code E 1 41, 1 OS_ERR_EVENT_TYPE E 11 14 5, 2 46, 3 OS_MEM_INVALID_PART E 40 8 3, 0 49, 3 OS_TASK_DEL_ERR E 41 3 1, 1 50, 4 OS_PRIO_ERR E 42 69 25, 6 75, 9 OS_PRIO_INVALID 13 4, 8 80, 7 OS_TASK_DEL_ERR E 81 11 4, 1 84, 8 OS_TIME_INVALID_MINUTES E 82 2 0, 7 85, 6 OS_TIME_INVALID_SECONDS E 83 10 3, 7 89, 3 OS_TIME_INVALID_MILLI Ex 29 10, 7 100, 0 Total SEPT’ 04 41, 1 E 60 Frequency tables about the most typical error codes given by the OS 111 270 100, 0 WSRS '04 NO CODE 28

Other Results Moreover, after the injection campaigns it was possible to see how errors were propagated through the system. It was registered the corrupted system call and later which was the system call who finally detected the error, taking the time employed for the system to detect this situation. Error Propagation SEPT’ 04 WSRS '04 29

Other Results l To finish, results on which were the most critical system calls were obtained with the aim to improve their robustness and of course the final OS dependability. l For example, there are some data structures, related with the event control block, in which the injection produced a lot of failures and the most of times the system hanged. l This is due to in these structures is stored the list of tasks waiting for some event, so if the injection corrupts that information, the system loss the sequence of the next actions and goes to a non safe state without knowing how to react (the system hangs). ( l This give us information on where dedicate special attention due to an error on those data structures could provoke critical failures on the system. SEPT’ 04 WSRS '04 30

Conclusions l After the experiments, the error detection coverage, error detection latency times, error propagation, typical OS error codes, etc. have been obtained. l Fault injection into the code and data memory segments of the microkernel will be implemented too. l About possible improvements for the Micro. C/OS II to increase its dependability should take into account, that some detected errors in certain data structures could provoke critical failures on the system. l These detected data structures should implement some mechanism to protect the information they host. SEPT’ 04 WSRS '04 31

Future Research l In a next research work, these data have to be compared with other COTS RTOS working under the same conditions. l RT fault injector to minimize intrusion (Without internal debug support, intrusion > 0) l Nexus implemented fault injection l l l SEPT’ 04 Other architecture: Motorola MPC 565 Intrusion > null Preliminary results Better controllability and observability Best option to validate RTOS and applications WSRS '04 32

Contact Data Juan Pardo Fault Tolerant Systems Group Polytechnic University of Valencia Spain Email: juaparal@upvnet. upv. es Web: http: //www. disca. upv. es/gstf/ SEPT’ 04 WSRS '04 33