a8a093e43abe73fbb1c0dec7317c79fa.ppt
- Количество слайдов: 39
RTES/CMS Potential Collaboration RTES Group 7 March 2005 (NSF ITR grant ACI-0121658) Vanderbilt University of Illinois, Urbana Champaign University of Pittsburgh Syracuse University Fermilab
Outline n RTES Overview q n Tool Approach q q n Goals, Team, Deliverables Modeling, Armors, VLA Demo Description Potential Collaborations q q System Configuration Run-Control Fault Mitigation GUI April 28, 2005 2
RTES Team n The Real Time Embedded System Group q A collaboration of five institutions, n n n Physicists and Computer Scientists/Electrical Engineers with expertise in q q q n University of Illinois University of Pittsburgh University of Syracuse Vanderbilt University (PI) Fermilab High performance, real-time system software and hardware, Reliability and fault tolerance, System specification, generation, and modeling tools. NSF ITR grant ACI-0121658 April 28, 2005 3
RTES Goals n High availability q Fault handling infrastructure capable of n n n n Accurately identifying problems (where, what, and why) Compensating for problems (shift the load, changing thresholds) Automated recovery procedures (restart / reconfiguration) Accurate accounting Extensibility (capturing new detection/recovery procedures) Policy driven monitoring and control Dynamic reconfiguration q adjust to potentially changing resources April 28, 2005 4
RTES Goals (continued) n Faults must be detected/corrected ASAP q semi-autonomously n q n distributed and hierarchical monitoring and control Life-cycle maintainability and evolvability q n with as little human intervention as possible to deal with new algorithms, new hardware and new versions of the OS User-defined Actions q Customized to application/users April 28, 2005 5
Modeling Synthesis The RTES Solution (for BTe. V) Analysis Global Fault Manager Soft High Level April 28, 2005 Performance Diagnosability Reliability Region Operations Mgr Region Fault Mgr Logical Data Net Global Operations Manager Logical Control Network Experiment Control Interface Fault Algorithms Behavior Synthesis Logical Data Net Design and Analysi s Runtime Feedback Resource Reconfigure L 2/3 L 1 Real Time Hierarchical fault management Hard Low Level 6
RTES Concepts n A hierarchical fault management system and toolkit: q Model Integrated Computing n q ARMORs (Adaptive, Reconfigurable, and Mobile Objects for Reliability) n q GME (Generic Modeling Environment) system modeling tools Robust framework for detection and reaction to faults in processes VLAs (Very Lightweight Agents for limited resource environments) n April 28, 2005 Sensors/actuators to monitor/mitigate at every level 7
Configuration through Modeling n Multi-aspect tool, separate views of q q q n Model interpreters can generate the system q q n Hardware – components and physical connectivity Executables – configuration and logical connectivity Fault handling behavior using hierarchical state machines At the code fragment level (for fault handling) Download scripts and configurations Modeling “languages” are application specific q q Shapes, properties, associations, constraints Appropriate for application/context n n April 28, 2005 System model Messaging Fault mitigation GUI, etc. 8
Modeling Environment: GME* §Fault handling §Process dataflow §HW Configuration * GME is an Open-Source, Meta-configurable, multi-aspect graphical modeling tool April 28, 2005 9
System Integration Modeling Language – SIML n Model Component Hierarchy and Interactions q n n Model information relevant for system configuration Links to other narrowly focused modeling languages q n Loosely specified model of computation provides overall picture and access to models in other languages Overall Deployment View April 28, 2005 10
System Architecture expressed with SIML n n n April 28, 2005 Run. Control Manager Router Information How many regions ? How many worker nodes inside the region? Node Identification information 11
SIML - Generation Generator • Configuration files • Build Scripts • Deployment Scripts • Router Configurations April 28, 2005 12
Data Type Modeling Language – DTML • Modeling of Data Types and Structures • Auto-generate marshallingdemarshalling interfaces for communication April 28, 2005 13
Fault Mitigation Modeling Language - FMML C n B A n n April 28, 2005 Specification of Fault Mitigation Behavior using Hierarchical Finite State Machines (A) Configuration and instantiation of FM behaviors as ARMORs (B) Specification of FM Triggering Communication (C) 14
FMML Generation FMML Model – Behavior Aspect Translator ARMOR Microkernel Switch(cur_state) case NOMINAL: I f (time<100) { next_state = FAULT; } Break; case FAULT if () { next_state = NOMINAL; } break; class armorcallback 0: public Callback { public: ack 0(Controls. Cection *cc, void *p) : Callback. Fault. Inject. Tererbose>(cc, p) { } void invoke(Fault. Injecerbose* msg) { printf("Callback. Recievede dtml_rcver_Local. Armor_ct *Lo; mc_message_ct *pmc = new m_ct; mc_bundle_ct *bundlepmc->ple(); pmc->assign_name(); bundle=pmc->push_bundle(); mc); } }; n n n Fault Tolerant Custom Element Communication Custom Element Model translator generates fault-tolerant strategies and communication flow strategy from FMML models Strategies are plugged into ARMOR infrastructure as ARMOR elements ARMOR infrastructure uses these custom elements to provide customized faulttolerant protection to the application April 28, 2005 15
User Interface Modeling Language n n n Enables reconfiguration of user interfaces Structural and data flow codes generated from models User Interface produced by running the generated code Example User Interface Model April 28, 2005 16
User Interface Generation Generato r April 28, 2005 17
RTES Demonstration at IEEE RTAS 05 n Used Tools and Models to Generate a Family of Demos q n n n 4, 16, 32, and 64 Processor Systems Demonstrates Fault Mitigation in a L 2/L 3 Trigger Prototype for BTe. V GUI: Matlab-based q GUI design specified by GME models (GUIML) Network/Messaging: Elvin publish/subscribe q Messages defined by GME models (DTML) Run. Control (RC) state machines q Defined by GME models (SIML) Infrastructure: ARMORs q Custom Fault Mitigation elements defined by models (FMML) Application: L 2/3 Filter. App, Data. Source q Actual physics trigger code q File-reader supplies physics/simulation data to the Filter. App April 28, 2005 18
April 28, 2005 19
Potential RTES contributions to CMS n Graphical Modeling Tools for: q Specifying Function. Manager State. Machines with FMML q Specifying Communication Messages at a higher-level of abstraction with DTML n q Designing GUIs independent of the implementation technology with GUIML n q Can synthesize Java applet code for rendering and communication over SOAP Designing System Configurations with SIML – a la Duck. CAD n n Can synthesize serialization/deserialization code for the specific implementation technology such as SOAP Can synthesize artifacts in addition to XML configuration files Fault Tolerance Approach and Concepts q Hierarchical Fault Mitigation via collaborating/coordinating FM Managers q Custom fault-mitigation behavior specification as hierarchical finite state machines q ARMORs and VLAs April 28, 2005 20
Potential RTES/RCMS Mapping SIML Configures DTML FMML GML April 28, 2005 And others… 21
Modeling Configurations with SIML XDAQ pt. MAZE example Defining a ‘partition’ or ‘region’ <? xml version='1. 0'? > <Partition> <Definitions> <Class. Def id="15">pt. MAZE</Class. Def> <Class. Def id="11">Round. Trip</Class. Def> </Definitions> <Host id = "0" url="http: //host 1: 40000"> <Address type="pt. MAZE" port="56" board. Id="0" service="maze_service_immediate" switch="MAZE_SWITCH_M 3 E 128"/> <Application class="Round. Trip" target. Addr="auto" instance="0" network="pt. MAZE"> <Default. Parameters> <Parameter name="samples" type="unsigned long"> 1000000 </Parameter> <Parameter name="start. Size" type="unsigned long"> 0 </Parameter>. . . </Default. Parameters> </Application> <url. Application>~/[…]/linux/x 86/lib. Round. Trip. so</url. Application> April 28, 2005 Host or application attributes 22
Modeling Configurations with SIML XDAQ pt. MAZE example Defining communications <Transport class="pt. MAZE" target. Addr="auto" instance="0"> <Default. Parameters> <Parameter name="polling. Mode" type="bool"> false </Parameter> <Parameter name="mtu. Size" type="int"> 4096 </Parameter>. . . </Default. Parameters> </Transport> <url. Transport> ~/[…]/linux/x 86/libpt. MAZE. so </url. Transport> </Host> <Host id = "1" url="http: //host 2: 40000"> <Address type="pt. MAZE" port="57" board. Id="0" service="maze_service_immediate" switch="MAZE_SWITCH_M 3 E 128"/> <Application class="Round. Trip" target. Addr="auto" instance="1" network="pt. MAZE"> </Application> <url. Application> ~/[…]/linux/x 86/lib. Round. Trip. so </url. Application> Defining an application April 28, 2005 protocol attributes 23
Function Manager State. Machine Artifacts Statemachine. java 1. Setup state-machine. • States. java 2. Set of possible states. • Inputs. java 3. Set of possible triggers. • Transition. Actions. java 4. During state transition. • Transition. Failed. Action. java 5. Transition Action failed. • State. Changed. Action. java 6. When state has changed. • Failure. Action. java 7. • April 28, 2005 Transition failed. 24
State. Machine. java public class Hello. State. Machine extends User. State. Machine {. . . State. Machine. Definition fsmdef = new State. Machine. Definition(); // Inputs (Commands) fsmdef. add. Input(Hello. Inputs. GOTOHELLO); fsmdef. add. Input(Hello. Inputs. GOTOINIT); // Initial state fsmdef. set. Initial. State (Hello. States. INITIAL); // States fsmdef. add. State(Hello. States. INITIAL); fsmdef. add. State(Hello. States. HELLO); fsmdef. add. Transition( Hello. Inputs. GOTOHELLO, Hello. States. INITIAL, Hello. States. HELLO, new Callback(evb. Transition. Actions, hello. Action) ); } April 28, 2005 As expressed in FMML Models 25
States. java public final class Hello. States { public static final State INITIAL = new State( "Initial" ); public static final State HELLO = new State( "Hello" ); public static final State ERROR = new State( "Error" ); } April 28, 2005 26
Inputs. java public class Hello. Inputs { public static final Input GOTOHELLO = new Input( "Go. To. Hello" ); public static final Input GOTOINIT = new Input( "Go. To. Init" ); } April 28, 2005 27
Transition. Actions. java public class Hello. Transition. Actions extends User. Transition. Actions { … public void hello. Action() throws User. Action. Exception { System. out. println("hello. Action Executed" ); logger. info( "hello. Action Executed"); } } April 28, 2005 28
Transition. Failed. Action. java public class Hello. Transition. Failed. Actions extends User. Transition. Failed. Actions { … public void hello. Failed. Action() throws User. Action. Exception { logger. info("Executing hello. Failed. Action"); get. User. State. Machine(). set. State(Da qkit. States. ERROR ); logger. info(“hello. Failed. Action Executed"); } } April 28, 2005 (This requires extensions to the modeling language) 29
SOAP Messages - Client-Server Example Serializing: SOAPName command. Name = envelope. create. Name ( “increment” ); SOAPName originator = envelope. create. Name ( “originator” ); SOAPName target. Addr = envelope. create. Name ( “target. Addr” ); SOAPBody body = envelope. get. Body(); SOAPElement command = body. add. Body. Element ( command. Name ); …………. . We can provide an abstract API call for creating message such that the user code need not have any understanding of the underlying SOAP calls April 28, 2005 30
Deserializing SOAP Messages Reply to the message – deserializing April 28, 2005 SOABBody body = reply. get. SOAPPart(). get. Envelope(). get. Body(); if (body. has. Fault()) { SOAPFault fault = body. get. Fault(); string msg = “Server error: “; msg += fault. get. Fault. String(); XDAQ_RAISE (xdaq. Exception, msg); } else { SOAPName counter. Tag (“Counter”, “”); vector<SOAPElement> content = body. get. Child. Elements(); for (int i = 0; i < content. size(); i++) { vector<SOAPElement> c = content[i]. get. Child. Elements(counter. Tag); for (int j = 0; j < c. size(); j++) 54 { if (c[0]. get. Element. Name() == counter. Tag) { cout << “The server replied with counter: “; cout << c[0]. get. Value() << endl; 59 } 31 }
I 2 O Message – RU Builder example • DTML language allows both simple and composite types • Floats , integers , signed , unsigned can be specified. • Corresponding marshall –demarshall code can be generated from models April 28, 2005 32
Discussions n n n Relevancy/Interest? Tuning the concepts. Next Steps? q More documentation CMS/XDAQ n q XDAQ Examples n n n GUI, SOAP messages, I 2 O Messages, State Machines, Data monitor, … Full, Larger-scale, Applications How can we contribute? Visit? ~June 4 th q Goals, Preparations? April 28, 2005 33
Backup Slides 7 March 2005
ARMOR: Adaptive Reconfigurable Mobile Objects of Reliability Execution ARMOR Oversees application process (e. g. the various Trigger Supervisor/Monitors) Exec ARMOR App Process Daemon network Heartbeat ARMOR Detects and recovers FTM failures Daemon Heartbeat ARMOR Fault Tolerant Manager (FTM) Daemons Detect ARMOR crash and hang failures Fault Tolerant Manager Highest ranking manager in the system ARMOR processes Provide a hierarchy of error detection and recovery. ARMORS are protected through checkpointing and internal self-checking. April 28, 2005 35
Very Lightweight Agents n n Minimal footprint Platform independence q n n Employable everywhere in the system! Monitors hardware and software Handles fault detection & communications with higher level entities April 28, 2005 Level 2/3 Farm Nodes (Linux) Hardware VLA Network API OS Kernel (Linux) Physics Application L 2/L 3 Manager Nodes (Linux) 36
L 2/3 Prototype Farm Setup 100 BT ethernet 1000 BT ethernet April 28, 2005 37
The Demonstration System Architecture laptop Iron public laptop Matlab Elvin Ganglia private Boulder Elvin Global RC, ARMOR Regional RC, ARMOR Worker Worker RC, VLA, ARMOR Filter. App … … … Data. Source file reader April 28, 2005 38
Matlab GUI • Monitoring/Display of Node, and Region Health and Performance • Command Interface for starting/stopping system • Debug Interface for injecting faults April 28, 2005 39
a8a093e43abe73fbb1c0dec7317c79fa.ppt