- Количество слайдов: 77
TU Wien The Time-Triggered Architecture H. Kopetz November 2014
Outline • • Introduction Global Time TTA Components Determinism Communication in the TTA Error Handling Mechanisms Conclusion
Automation of Sinter Plant at Posco, Korea Around 1975, my employer, VOEST, constructed a computerized sinter plant in Pohang, South Korea. • With state of the art real-time computer technology much time and patience was needed in the integration phase to tune and commission the plant on site. • Precise specification of the temporal properties of the interfaces of the components is missing. • Need for a solid engineering methodology for the design of dependable distributed real-time systems.
Academic Emeritus Industrial 2013 2008 2003 TU Wien 1998 Start of TTTech 1993 1988 TU Berlin Voest Alpine, Linz Ass. Prof USA Ph. D in Physics, Vienna 1983 1978 1973 Problem Awareness 1968 CV of Hermann Kopetz
The Objectives of the TTA The TTA is based on more than thirty years of research on the topic of dependable real-time systems. • To establish a platform for the design of dependable component-based cyber-physical systems (CPS) that can be adapted to the needs of different application domains. • To provide global coordination in a distributed realtime system without a central point of failure. In order to understand the design decisions taken in, and the dependability mechanisms provided by, the TTA we have to understand the characteristics of CPS.
CPS: Physical World meets Cyber World P-System C-System j m k Real Time Com. System n l Event in the P System Interface Node of the C-System
Physical (P) System versus Cyber (C) System P-System Controlled by the laws of physics C-System Controlled by program execution Physical time Time base dense Execution time Time-base sparse The model of the P-system that is used by the C-system must be aware of the progression of Physical time.
Focus on Safety Critical Systems--No Backup Fly-by-wire Airplane: There is no mechanical or hydraulic connection between the pilot controls and the control surfaces. Drive-by-wire Car: There is no mechanical or hydraulic connection between the steering wheel and the wheels.
No Backup--the 10 -9 Challenge u u u The system as a whole must be more reliable than any one of its components: e. g. , System Dependability 1 FIT--Component dependability 1000 FIT (1 FIT: 1 failure in 109 hours) Architecture must be distributed and support fault-tolerance to mask component failures. System as a whole is not testable to the required level of dependability. The safety argument is based on a combination of experimental evidence about the expected failure modes and failures rates of fault-containment units (FCU) and a formal dependability model that depicts the system structure from the point of view of dependability. Independence of the FCUs is a critical issue.
Mitigation of Soft Hardware Errors Architectural means to mitigate the consequences of component failures might become a necessity when using the upcoming submicron devices, as stipulated in the International Roadmap of Semiconductors: Relaxing the requirement of 100% correctness for devices and interconnects may dramatically reduce the costs of manufacturing, verification and test. Such a paradigm shift is likely forced in any case by technology scaling, which leads to more transient and permanent failures of signals, logic values, devices and interconnects.
Certification as a Design Driver Certification is only possible if a system has been designed for certification: Independence of Fault-Containment Units (FCU) u Elimination of temporal error propagation from one FCU to the network and in consequence to the other FCUs. u Deterministic Operation u Formal Analysis of Critical Algorithms u Modular Composition of Correctness Argument u
Control System Requirements--Periodicity is not mandatory, but often assumed as it leads to simpler algorithms and more stable and secure systems. Most of the algorithms developed with this assumption are very sensitive to period duration variations, jitter at the starting instant. This is especially the case of motor controllers iin precision machines. Simultaneous sampling of inputs is also an important stability factor. From : Decotignie, J. , D. , Which Network for Which Application, in The Industrial Communication Technology Handbook, R. Zuwarski, Editor. 2005, Taylor and Francis: Boca Raton. p. 19/1 -19/15. -- p. 19 -4
Architectural Principles of the TTA Component Orientation—A component is a hardware/software unit. u All components are time-aware—A global time is provided by the platform. u Separation of Computation from Communication— Components and Communication systems can be developed independently. u Core Services are deterministic—Modular Certification is supported by the architecture. u Different Integration Levels—IP-Cores form a Chip, Chips form a Device, Devices form systems u Being faulty is normal. u
Need for a Global Time A valuable lesson from the August 14 blackout is the importance of having time-synchronized system data recorders. The Task Force’s investigators labored over thousands of data items to determine the sequence of events, much like putting together small pieces of a very large puzzle. That process would have been significantly faster and easier if there had been wider use of synchronized data recording devices. From Final Report on US-Canadian August 14, 2003 Power Blackout, p. 164.
Need for Timeliness On November 21, 2012 it has been announced that the delivery of the eight trains will not take place as originally planned ahead of the major train schedule change of December 9, 2012. According to [Spi 12] the reason for this delay is in the complexity of the electronics. One problem is the time delay of one second between the execution of the control command by the driver and the activation of the brakes, which extends the braking distance by up to 70 meters. [Spi 12] Der Spiegel on line. [accessed 19. December 2012].
RT Information has limited temporal validity An appropriate model of RT communication must consider timeliness as important as correctness.
Communication—The Choices T Competition versus Cooperation Event Triggered Time Triggered Probabilistic Deterministic
Event-Triggered Communication • A new message is sent, whenever a significant event occurs. • Delay in the queues is unpredictable. • What happens under Peak Load?
The Alternative: Time Triggered Messages Event Triggered: • Spontaneous • Uncoordinated • Best Effort • Trashing under Peak load Time Triggered: • Time Schedule • Coordinated • Planned • No Trashing
Timeliness: Event-Triggered vs Time-Triggered Probability Density ET: dmax unbounded TT: dmax = Precision of Clock Synchronization Timeout for Error Detection TT ET ? Real Time
Time-Awareness Characteristic for the TTA is the availability of a global physical time of known precision at every component of the architecture. This global time is used to • Synchronize actions in the physical world and in the cyber world • Limit the validity of real-time information • Control access to shared resources • Strengthen security protocols • Detect failures of fail-silent components
The Central Research Question: The development of a clock synchronization algorithm and an associated communication protocols with the following properties: • Fault tolerance: No single point of failure. • Vigilant: Any violation of an algorithmic assumption must be detected immediately. • Efficiency: Small data overhead and few extra messages. • Correctness: Simple in order that its correctness can be established by formal methods. • Precise Interface Specification to the Components.
Cyclic Representation of Time Real-Time 1 A 2 B 3 1 Real Time ground 1 state A C 4 D 5 E 6 1 A 2 B 3 2 B 6 3 E 5 C D 4 C 4 D 5 E 6 Start of Cycle Observation of Sensor Input Start of Transmission of Sensor Data Transmission of Input Data Start of Processing of Control Algorithm Termination of Processing Transmission of Output Data Start of Output to Actuators Output Operation at the Actuator Termination of Output Operation
Models of Time in a CPS Dense Physics B A Real Time Discrete Central Computer Real Time Sparse Distributed Computer Real Time
Models of Time in a CPS Dense Physics B A Real Time Discrete Central Computer Real Time Sparse Distributed Computer Precision of the Global Time Real Time
Clock Synchronization Condition
One Tick Difference: What Does it Mean? Because of the accumulation of the synchronization error and the digitalization error, it is not possible to reconstruct the temporal order of two events from the knowledge that the global timestamps differ by one.
Reasonableness Condition The global time t is called reasonable, if all local implementations of the global time satisfy the following reasonableness condition for the global granularity g of a macrotick: g > This reasonableness condition ensures that the synchronization error is bounded to less than one macrogranule, i. e. , the duration between two macroticks.
Fundamental Limits to Time Measurement Given a distributed system with a reasonable global timebase with granularity g. Then the following fundamental limits to time measurement must be observed: u If a single event is observed by two nodes, there is always the possibility that the timestamps will differ by one tick u Let us assume that dobs is the observed duration of an interval. Then the true duration dtrue is ( dobs - 2 g) < dtrue <( dobs + 2 g) u The temporal order of events can only be recovered, if the observed time difference dobs 2 g u The temporal order of events can always be recovered, if the event set is 0/3 g precedent.
Sparse Time Model in the TTA Whenever we use the term time we mean physical time as defined by the international standard of time TAI. If the occurrence of events is restricted to some active intervals on the timeline with duration with an interval of silence of duration between any two active intervals, then we call the time base / -sparse, or sparse for short, and events that occur during the active intervals sparse events.
The Intervals and in a Sparse Timebase • Depend on the precision P of the clock synchronization. • In reality, the precision is always larger than zero—in a distributed system clocks can never be fully synchronized. • The precision depends on the stability of the oscillator, the length of the resynchronization interval and the accuracy of interval measurement. • On a discrete time-base, there is always the possibility that the same external event will be observed by a tick difference.
Complexity Management in the TTA The architectural style of the TTA deploys the following simplification strategies to reduce the complexity of a design: u u Partitioning: The partitioning of a system into nearly autonomous subsystems (components). --Physical Structure Abstraction: The introduction of abstraction layers whereby only the relevant properties of a lower layer are exposed to the upper layer--Structure and Behavior Segmentation: The temporal decomposition of complex behavior into small parts that can be processed sequentially (“step-by-step”)--determinism helps! Recursion: Use of the same general concepts and mechanisms at different levels of abstraction
TTA Architecture Services Overview Application Specific Services, Including Middleware Application MW Domain Specific Services Optional Services Core Services DSC e. g. , AUTOSAR DS e. g. , Message Transport Clock Synchronization Core Reconfiguration Different Robustness Implementation Choices Security
What is a TTA Component? u u Hardware/software unit that accepts input messages, provides a useful service, maintains internal state, and produces after some elapsed time output messages containing the results. It is aware of the progression of physical time Unit of abstraction, the behavior of which is captured in a high-level concept that is used to capture the services of a subsystem. Fault-Containment-Unit (FCU) that maintains the abstraction in case of fault occurrence and contains the immediate effects of a fault (a fault can propagate from a faulty component to a component that has not been affected by the fault only by erroneous messages). Unit of restart, replication and reconfiguration in order to enable the implementation of robustness and fault-tolerance.
Model Driven Design of a Component Domain Specific Application Model (e. g. , expressed in UML) Platform Independent Model (PIM) expressed in a High-Level Language (e. g. , System C). Platform Independent Model(PIM) focuses on functionality and time Platform Specific Model (PSM) (nonfunctional properties)
Performance Trends--Power Gops/Watt 1000 ASIC 100 FPGA 10 CPU 1 Cell 0. 1 0. 01 1990 1995 2000 2005 2010 Ref: Lauwereins, Imec, MSOP 2006
The Interfaces of a TTA Component Connection to local sensors, actuators, man -machine interface, other systems Local Interface-to the Environment (unspecified) View Inside Technology For the Dependent (remote) Interface Maintenance Expert (TDI) Control Technology Hardware/Software Independent Interface Unit Linking Interface LIF-Provides the service to the User (TII) Configuration Restart Reset Power level Relevant for the integration of components into a cluster of components
Software within a Component Application SW Handles LIF u u GEM handles TII Mini RTOS handles TDI and Local interf. u u u Hardware u u Downloading of the component software, the Job, into the component hardware via the TII interface. Communicate with other System Components to establish ports and dynamic links, to reintegrate components after a transient fault etc. Global time management Provision of API Services (e. g. , send and receive of a message) Scheduling of the tasks within a component Service of the TII Interface to reset, start, and terminate the operation of a component Provision of Generic Middleware Services (GEM)
Operating System in the TTA In the TTA, the functions of a monolithic operating system are partitioned into A Mini-RTOS in each node including generic middleware, and u An open-ended set of autonomous OS Components that provide operating system services such as • Device Controllers (Gateway Components) • Integrated Resource Management • Security • Diagnosis and Robustness • Shared Memory Component u
Look at the Inside of a Component Local Interfaces TII LIF The TII (technology independent interface) gives access to the services of the generic middleware (GEM).
Linking Interface Specification The Linking interface is a message-based interface. Its specification consists of three parts: Transport Specification: contains the information needed to transport a message from the sender to the receiver. Temporal properties are part of the transport specification. u Operational Specification: specifies the syntactic structure of the bit stream contained in the message and establishes the message variable names that point to the concepts at the meta level. u Meta-Level Specification: assigns meaning to the message variable names established by the operational specification. u
Example of a Cluster of Components I/O Driver Interface Assistant System LIF Gateway Host Body Computer LIF LIF LIF Brake Manager Engine Control Steering Manager Vehicle Comm I/O I/O (( )) Vehicle to Vehicle Communication
Example of a Cluster--Recursion I/O Driver Interface Assistant System LIF Gateway Host Body Computer LIF LIF LIF Brake Manager Engine Control Steering Manager Vehicle Comm LIF I/O I/O (( )) Vehicle to Vehicle Communication
Gateway Components Gateway components can connect two clusters that may be based on different architectural styles: T. Luft = 0 Green LIF Gateway Component T. Air = 32 Red LIF • The representation of the information in the two clusters will be different, but the semantic content of the message variables must be the same.
RT-Protocol is at the Center of an Architecture u u u The properties of the protocol determine timeliness, composability, error containment, etc. Many of today’s RT Protocols have been designed bottom up with limited architectural vision. (Sometimes with the intention to lock a customer to a supplier). Technology trends force a consolidation.
Unidirectional Deterministic Multi-cast Message u Uni-directionality is required to u Determinism is required to • decouple communication from computation • decouple the sender behavior from the receiver behavior • • • u establish timeliness simplify the reasoning about the behavior (modus pones) simplify testing (repeatable test cases) be able to implement active replication (TMR) support the certification Multi-cast is required to support • the independent observation of the component behavior • the support of diagnosis • replication of state at multiple components • Triple Modular Redundancy
Determinism of a Communication Channel The behavior of a communication channel is called deterministic if (as seen from an omniscient external observer): u A message is delivered before an a priori known instant (timeliness). u The receive order of the messages is same as the send order. The send order among all messages is established by the temporal order of the send instants of the messages as observed by an omniscient observer. u If the send instants of n (n>1) messages are the same, then an order of the n messages will be established in an a priori known manner. Two independent deterministic channels will deliver messages always in the same order before an a priori known instant.
Determinism: Timeliness and Order Timeliness Consistent Temporal Order Probability of Message Arrival 1 TT Tom ET Tom Ann Tom Temporal Delay of ET Time Switch Ann Switch
Temporal Order is Obvious A B Real Time Red Channel Green Channel
Determinism: Simultaneity--Who Wins? A B Determinisn: If A wins on the red channel Real Time then A must also win on the green channel Red Channel Green Channel
Handling of Simultaneity—A Fundamental Problem In the hardware : meta-stability In operating systems: mutual exclusion In communication systems: order of messages A two-step solution: (i) Provide consistent view of simultaneity—distinguish between events that are in the sphere of control (So. C) of the system and events that are outside the So. C--difficult (ii) Order simultaneous events according to some a priori established criterion--easy
Events outside the So. C: Agreement Protocols If the occurrence of events is restricted to some active intervals with duration with an interval of silence of duration between any two active intervals, then we call the time base / -sparse, or sparse for short. Non-sparse event
Agreement Protocol to Generate Sparse Events To arrive at a consistent view of the temporal order of nonsparse events within a distributed computer system (which does not necessarily reflect the temporal order of event occurrence), the nodes must execute an agreement protocol. (i) exchange information about the observations among all nodes, such that all nodes have the same data set. (ii) every node executes the same algorithm on this data set to arrive at a consistent value and at a sparse interval of the observation.
An Impossibility Results It is impossible to represent the temporal properties of the dynamic analog physical world with true fidelity in digital cyberspace. • The conflict between fidelity and consistency can be reduced, but can never be fully resolved. • The better the precision of the clock synchronization, the smaller the error introduced by digitalization and synchronization.
Core: Time-Triggered Communication In a time-triggered communication system, the sender and receiver(s) agree a priori on a cyclic time -controlled conflict-free communication schedule for the sending of time-triggered messages. • In every period, a message is sent at exactly the same phase. • The control and data flow is unidirectional. • Error detection is in the sphere of control of the receiver, based on his a priori knowledge of the arrival instants of messages. •
Fault Containment vs. Error Containment We do not need an error detector in the value domain if we assume fail-silence. Error Propagation Error Detector must be in a separate FCU
Temporal Error Containment by the CS It is impossible to maintain the communication among the correct components of a RTcluster if the temporal errors caused by a faulty component are not contained. Error containment of an arbitrary temporal node failure requires that the shared Comm. System is a self-contained FCU that has temporal information about the allowed behavior of the nodes– It must contain applicationspecific state. Temporal Error Containment Boundary Shared Communication System Babbling idiot High priority message in a CAN System?
TTP and TT-Ethernet TTP (Time-Triggered Protocol) and TTE (Time. Triggered Ethernet) have been developed to link devices of the TTA. u TTP is very data efficient and operates up to a bandwidth of 20 Mbits/second. u TTE is fully compatible with standard Ethernet and operates up to 1 Gigabit/second. It has been selected as the standard protocol for NASA. u TTP and TTE have been standardized by the SAE. u Both protocols are time-triggered deterministic protocols that provide temporal error containment.
The Swiss-Cheese Model Subsystem Failure From Reason, J Managing the Risk of Organizational Accidents 1997 Multiple Layers of Defenses Normal Operation On-Chip TMR Off-Chip TMR NGU Strategy Catastrophic System Event
Levels of Fault Mitigation in the TTA I. Normal Operation II. Swift Component Recovery after a transient fault III. On-Chip TMR: to handle transient and permanent faults within a chip IV. Off-Chip TMR: to handle a transient and permanent fault of a total chip. V. NGU (Never-Give-Up) Strategy: to handle multiple correlated transient faults.
Fault Containment Unit (FCU) A Fault-Containment Unit (FCU) is an independent subsystem with well-defined interfaces such that the immediate consequences of a single fault (e. g. , hardware, software) are contained within this subsystem. Examples: A fail-silent component A micro-processor including the software with a well-defined message interface.
Fault-Tolerant Unit (FTU) A fault-tolerant unit (FTU) is a set of actively redundant FCUs (components) that provide a fault tolerant service to its environment: u FTUs have to receive identical input messages in the same order u FTUs have to operate in replica determinism u The output messages of FTUs should be idempotent u As long as a defined subset of the components of the FTU is operational, the FTU is considered operational FTUs provide the continuous service by fault masking.
Fault Masking: How Many FCUs in an FTU? A B C assumption no assumption about FCU fail-silent FCUs Synchronized FCUs (arbitrary) k+1 3 k + 1 2 k + 1 What is the assumption coverage in cases A and B?
What is Needed to Implement TMR? What architectural services are needed to implement Triple Modular Redundancy (TMR) at the architecture level? u Provision of an Independent Fault-Containment Region for each one of the replicated components u Synchronization Infrastructure for the components u Predictable Multicast Communication u Replicated Communication Channels u Support for Voting u Replica Deterministic (which includes timely) Operation u Identical state in the distributed components
Replica Determinism: Airplane on Takeoff Consider an airplane that is taking off from a runway with a flight control system consisting of three independent channels without a global time. Consider the system at the critical instant before takeoff: Channel 1 Channel 2 Take off Accelerate Engine Abort Stop Engine
The Critical Role of Time Speed Timeout Channel 1 Critical Takeoff Speed Timeout Channel 2 Real Time
Replica Determinism: Airplane on Takeoff Consider an airplane that is taking off from a runway with a flight control system consisting of three independent channels. Consider the system at the critical instant before takeoff: Channel 1 Channel 2 Channel 3 Take off Accelerate Engine Abort Stop Engine Take off Stop Engine (Fault)
Replica Determinism: Airplane on Takeoff Consider an airplane that is taking off from a runway with a flight control system consisting of three independent channels. Consider the system at the critical instant before takeoff: Channel 1 Channel 2 Channel 3 Majority Take off Accelerate Engine Abort Stop Engine Take off Stop Engine (Fault)
Triple Modular Redundancy (TMR) is the generally accepted technique for the mitigation of component failures at the system level: V O T E R A B V O T E R A/1 V O T E R B/1 A/2 V O T E R B/2 A/3 V O T E R B/3
On-Chip TMR 3/19/2018 FP 7 GENESYS 70
Off-Chip TMR Red DAS Voting Actuator TT Ethernet Green DAS TT Ethernet Switch Blue Switch Brown
Example: TMR Configuration Voting Actuator TT Ethernet Switch Blue Voting Actuator Switch Red Standard Ethernet TT Ethernet
Example: TMR Configuration Voting Actuator TT Ethernet Switch Blue Voting Actuator Switch Red Standard Ethernet TT Ethernet
Conclusion • The TTA provides a framework for the componentbased design of dependable distributed real-time systems. • The framework can be adapted to different application domains by a hierarchical composition of services. • The fault-tolerance mechanism of the TTA mask transient and permanent failures. • The conceptual simplicity of the TTA supports the implementation of fast recovery actions in cyberphysical systems.
More Information Background information can be found in the second revised edition of my book Real-Time Systems—Design Principles for Distributed Embedded Applications published by Springer Verlag on April 27 , 2011.
TTTech at a Glance Today • TTTech Computertechnik AG founded in 1998 • Core Business: Reliable Computer Systems based on the timetriggered technology • TTP • TTEthernet • TTTech owns about 140 patents in the field of TT technology. • Light House Customers • Audi –A 8, A 6 • Boeing— 787 Dreamliner, Embraer • NASA--Orion Space Program • Vestas—Wind Turbines • About 350 employees and a turnover of about 50 Mio €. • Subsidiaries in DE, US, JP, CZ, RO, IT • Cumulative RD Budget: More than 100 Mio € over the years.
Anytime Algorithms Help Start of Processing of Algorithm Time. Triggered Deadline Core Optional segment Improves delivers Iteratively the satisfycing quality of results Real-Time