be42ec375d9f40eb62fd8fc70c89de77.ppt
- Количество слайдов: 38
FPGA Self-Repair using an Organic Embedded System Architecture Kening Zhang, Jaafar Alghazo and Ronald F. De. Mara University of Central Florida 06 December 2007
Organic Computing (OC) biologically-inspired computing with “self-x” properties Technical Objective: support long lifetime missions with multiple failure occurrences Research Focus: Reliability Availability Sustainability OC Approach: addresses system controllability with increasing complexity System Property Composed of large collection of autonomous systems Self-x Characteristics • Self-organization • Self-configuration • Self-optimization Autonomous system owned sensor and actuators • Self-healing • Self-protection • Self-explaining Communication networks among autonomous systems • Context-awareness • Self-synchronization Example Relevance: How to achieve sustainable presence in NASA’s Moon, Mars & Beyond objective? ? ? Reconfigurable Hardware with Self-Healing based on SRAM FPGA platform Sponsors: NASA: FPGA platform and Genetic Algorithm research DARPA: OC approach and SOAR Longevity Platform
Goal: Autonomous FPGA Refurbishment increase availability without carrying pre-configured spares … Refurbishment Redundancy Overhead from Unutilized Spares weight, size, power increases with amount of spare capacity weakly-related to number recovery capacity restricted at design-time variable at recovery-time Granularity of Fault Coverage resolution where fault handled Fault-Resolution Latency availability via downtime required to handle fault Quality of Repair likelihood and completeness Autonomous Operation fix without outside intervention based on time required to select spare resource determined by adequacy of spares available (? ) yes based on time required to find suitable recovery affected by multiple characteristics (+ or -) yes
Fault-Handling Techniques for SRAM-based FPGAs Device Failure Characteristics Duration: Transient: SEU Permanent: SEL, Oxide Breakdown, Electron Migration, LPD Target: Approach: Device Processing Configuration Datapath BIST Scrubbing Evolutionary STARS Bitwise Comparison Majority Vote CED Vigander OC Supplementary Testbench Detection: Duplex Output Comparison (not addressed) Duplex/Triplex Output Comparison (not addressed) Autonomous Element (AE) unnecessary Autonomous Supervisor (AS) Population-based GA using Extrinsic Fitness Evaluation Evolutionary Algorithm using Intrinsic Fitness Evaluation Cartesian Intersection Worst-case Clock Period Dilation Diagnosis: Recovery: Processing Datapath TMR Methods Isolation: Device Configuration Reload Bitstream / Invert Bit Value Ignore Discrepancy Replicate in Spare Resource Fast Run-time Location Select Spare Resource
Autonomous System-on-a-Chip (ASo. C) Architecture Dual-layer ASo. C proposed by Lipsa et al [Lipsa 05] • Functional Layer • • Functional Elements (FEs) e. g. CPU, RAM, Network interface Autonomic Layer • • Autonomic Elements (AEs) • Monitor • Actuator • Communication interface Autonomic Supervisor (AS) UCF Approach for fault coverage Functional Layer & Autonomic Layer • achieved by assessing consensus among elements 1. 2. first to realize failure detection consensus provides an organic method for fitness evaluation of competing alternatives during evolution providing a self-regulating approach to fault resolution
EHW Environments • Evolvable Hardware (EHW) Environments enable experimental methods to research soft computing intelligent search techniques • EHW operates by repetitive reprogramming of real-world physical devices using an iterative refinement process: Extrinsic Evolution Two modes Genetic Algorithm of Evolvable Simulation in the loop Intrinsic Evolution or Genetic Algorithm Hardware in the loop Hardware Done? software model Build it device “design-time” refinement device “run-time” refinement Application Deep Space Satellite: • >100 FPGAs onboard • hostile environment: radiation, thermal stress • How to achieve reliability to avoid mission failure? ? ? new approach to Autonomous Repair of failed devices
Genetic Algorithms (GAs) Mechanism coarsely modeled after neo-Darwinism (natural selection + genetics) start replacement offspring population of candidate solutions mutation crossover parents selection of parents Fitness function evaluate fitness of individuals Goal reached
Genetic Mechanisms • Guided trial-and-error search techniques using principles of Darwinian evolution • GAs frequently use strings of 1 s and 0 s to represent candidate solutions - iterative selection, “survival of the fittest” genetic operators -- mutation, crossover, … implementor must define fitness function Genotype chromosomes of GA operation: if 100101 is better than 010001 it will have more chance to breed and influence future population Genotype changes during evolution must adhere to the Xilinx-defined format of bitstream To prevent undesirable conditions that may damage the FPGA such as a mutation which has two logic outputs tied together, a logical genotype is used for evolution and mapped to physical phenotype Logic # = functional logic index number for LUT Row/Column= physical location of LUT in FPGA • Can invoke Elitism Operator (E=1, E=2 …) - guarantees monotonically increasing fitness of best individual over all generations
Loosely Coupled Solution on Xilinx Virtex II Pro & Virtex 4 The Virtex 2 Pro/4 is mounted on a development board which can then be interfaced with a Work. Station running Xilinx EDK and ISE. The entire system operates on a 32 -bit basis
Organic Embedded System (OES) Architecture One Dimensional Column-oriented OES based on Xilinx Virtex II Pro FPGA platform • • • FEs and AEs reside on two distinct layers with interconnection structure between them AEs and FEs can either be realized in hardware, software, or co-design AE layer supervises functionality of FE elements while requiring no application-specific algorithms on the AE layer Observer/Controller architecture includes an AS element which had no counterpart to evaluate if the AS fault-free, so address by minimizing its complexity in proposed approach utilize Xilinx partial reconfiguration technology to manipulate relocatable bitstreams
OES AE Component Design AEs decentralize Observer/Controller functionality: • • • Concurrent Error Detection (CED) unit collects 2 FE Outputs for discrepancy identification A Checksum for AE fault detection which are checked against Stored Checksum values Evaluator of outputs from 2 FEs against checksum and Actuator which initiates recovery phase An important architectural property is that all AE components are identical in structure despite the fact that they monitor different types of FEs. Homogeneous characteristics deliver a uniform-behavior property leveraged for consensus-based evaluation fault-handling methodology OC Concept: although AE components add an additional complexity to the design, they will ease integration of fault-handling difficulties inherent with current commercial IP cores
Consensus-Based Evaluation (CBE) • Uses a Relative Fitness Measure - Pairwise discrepancy checking yields relative fitness measure - Broad temporal consensus in the population used to determine fitness metric - Transition between Fitness States occurs in the population - Provides graceful degradation in presence of changing environments, applications and inputs, since this is a moving measure • Test Inputs = Normal Inputs for Data Throughput - CBE does not utilizes additional functional nor resource test vectors - Potential for higher availability as regeneration is integrated with normal operation
Genetic Operators: Mutation Typical Approach: bit inversion of LUT functionality Selected Approach: input interconnection of LUTs mutated Rearrange input interconnection to search unused LUT resources which occlude faulty resource Mutation: Genotype chromosomes • original functionality is F = F 1·(F 3+F 4) w/ input F 2 unassigned by synthesis tool • mutation operator will change input F 4 to unused as F = F 1·(F 3+F 2) • shadow shows changed input and LUT contents Mutation: Phenotype chromosomes • some opportunity for input stuck-at fault or LUT content stuck-at fault. • functionalities of LUTs remain undistorted while search space explored
Genetic Operators: Cell Swapping Cell-Swap operation on Genotype chromosomes interchanges two distinct LUT blocks while maintaining correct logic order and functionalities in genotype • exchange all LUT input interconnections, LUT content and physical 2 -tuple (Col#, Row#) as well as the logic sequence Cell-Swap operation on Phenotype chromosomes
Genetic Operators: PMX Operator Partial Match Crossover (PMX) maintains crossover information as well as order information • two genotype configuration streams are aligned at LUT boundary • crossover site selected at random along LUT boundary • this crossover point defines a left/right partition used to affect crossover through LUT-by-LUT exchange • suppose crossover point at position 4 of the LUT vector: • first step is to map configuration B to configuration A by exchanging the following aligned LUTs {(4, 7), (5, 2), (6, 1), (7, 5)}. • Applying PMX results in two new configurations A’ and B’
Illustrative Example: Gate Level Design of OES • Experiment circuit: -bit Full-adder • Fault-free model: Duplex • Fault-impact model: TMR • Fault-detect model: CBE • Fault recovery strategy: GA operation • Experimental setup: - Hardware prototype implemented in Xilinx Virtex-II Pro FPGA - VHDL implementation - Using the GNAT library along with the MRRA framework and JTAG reconfiguration interface. 1
MCNC-91 Benchmark Case Studies Circuit Name Circuit Function Inputs Outputs Approximate Gates z 4 ml 2 -bit Add 7 4 20 cm 85 a logic 11 3 38 cm 138 a Logic 6 8 17 System Availability under Multiple Faults Fc = number of correct behaviors of FE observed during evolutionary recovery phase Fe = number of errant or discrepant behaviors 1 = exactly one output required to detect the fault during the original CED configuration. 1. 2 = number of the reconfigurations required, i. e. one from CED to TMR, and one back from TMR to CED 2. Fc 1 & Fe 1 = correct and faulty output number of the FE during the AE repair period 3. Fc 2 & Fe 2 = correct and faulty output number during the FE repair period 4. n = number of reconfigurations of the FE
Experimental Results • Fault Free arrangement: CED FEs with cold standby FE • Inject a stuck-at-zero or stuck-atone fault at one of the FE’s LUT input pins • CED -> TMR to identify faulty FE or AE • CBE used to resolve faulty AE Redundancy for both FE (RFE) and AE (RAE) = ratio of unused LUT inputs to total number of LUTs inputs Fc = number of correct behaviors of FE observed during evolutionary recovery phase Fe = number of errant or discrepant behaviors 1. n = number of reconfigurations of the FE 2. β represents reconfiguration to computation time ratio
Experimental Results • Fault Free arrangement: CED FEs with cold standby FE • Inject a stuck-at-zero or stuck-atone fault at one of the FE’s LUT input pins • CED -> TMR to identify faulty FE or AE • CBE used to resolve faulty AE Redundancy for both FE (RFE) and AE (RAE) = ratio of unused LUT inputs to total number of LUTs inputs Fc = number of correct behaviors of FE observed during evolutionary recovery phase Fe = number of errant or discrepant behaviors 1. n = number of reconfigurations of the FE 2. β represents reconfiguration to computation time ratio
Experimental Results • Fault Free arrangement: CED FEs with cold standby FE • Inject a stuck-at-zero or stuck-atone fault at one of the FE’s LUT input pins • CED -> TMR to identify faulty FE or AE • CBE used to resolve faulty AE Redundancy for both FE (RFE) and AE (RAE) = ratio of unused LUT inputs to total number of LUTs inputs Fc = number of correct behaviors of FE observed during evolutionary recovery phase Fe = number of errant or discrepant behaviors 1. n = number of reconfigurations of the FE 2. β represents reconfiguration to computation time ratio
Conclusion • A self-adaptation and self-healing OES architecture developed for autonomic operation without human intervention. • The OES architecture is capable of handling many single fault scenarios and several multiple fault scenarios for small digital logic design. • Experimental result support our design objectives during the repair phase averaged 75. 05%, 82. 21%, and 65. 21% for the z 4 ml, cm 85 a, and cm 138 a circuits respectively under stated conditions. • Reconfiguration time ratio (β) ratio is key factor limiting availability during AE repair • Future work: evaluate extensions of the OES architecture addressing scalability of in terms of pipelined stages
Backup Slides • On following pages …
Isolation of a single faulty individual with 1 -out-of-64 impact instantaneous DV (point values) for a sample individual in population and population oracles (solid lines) Sliding Window • Outliers are identified after EW iterations have elapsed • Expected D. V. = (1/64)*600 = 9. 375 from individual impacted by fault • Isolated faulty individual’s DV differs from the average DV by 3 after 1 or more observation intervals of length EW
Future Work: Development Board to Self-Contained FPGA Qualitative Analysis of CRR model • Number of iterations and completeness of regeneration repair • Percentage of time the device remains online despite physical resource fault (availability) Hardware Resource Management • Optimization of hardware profile for Xilinx Virtex II Pro Field Testing on SRAM-based FPGA in a Cubesat mission
OES Integrated FE and AE Failure Detection Procedure • System Initialization - • FE Fault Detection/Recovery - • FE Initialization step Compute Checksum step AE-CED fault detection FE fault-recovery AE fault detection Phase - A fault may exist in the CED, Actuator, or Evaluator, A fault may exist in Check Sum component, or A fault may exist in the Stored Check. Sum-LUT. Runtime inputs to FE applied to both active instance under a CED strategy. After allowing for FE inputs propagation time through the AE, the expected output will be supplied to AE-CED for the fault detection. The output of the FE is then compared in the AE-CED module and any
Previous Work Detection Characteristics of FPGA Fault-Handling Schemes … Strategy #1) Evolve redundancy into design before the anticipated failure or …
Previous Work Fault Recovery Characteristics of Selected Approaches … Strategy #2) Evolve recovery from specific failure after (and if) it occurs or …
CRR Arrangement in SRAM FPGA Configurations in Population • C = CL CR • CL = subset of left-half configurations • CR = subset of right-half configurations • |CL|=|CR |= |C|/2 Discrepancy Operator • Baseline Discrepancy Operator is dyadic operator with binary output: • Z(Ci) is FPGA data throughput output of configuration Ci • Each half-configuration evaluates using embedded checker (XNOR gate) within each individual • Any fault in checker lowers that individual’s fitness so that individual is no longer preferred and eventually undergoes repair WTA: = RS: = (Equivalence) (Hamming Distance)
Terminology and Characteristics Pristine Pool: CP. For any Ci C, is member of CP at generation G if and only if Suspect Pool: CS. For any Ci C, is member of CS at generation G if and only if at least one of Under Repair Pool: CU: For any Ci C, is member of CU at generation G if and only if Refurbished Pool: CR: after Genetic Operator applied, the new generated individual is member of CR at generation G if and only if ED is Discrepancy Count of Ci and EC is Correctness Count of Ci Length of Evaluation Fitness Window: W = ED+ EC Fitness Metric: f(Ci) =EC/ EW
Sketch of CRR Approach Premise: Recovery Complexity << Design Complexity 1. Initialization - Population P of functionally-identical yet physically-distinct configurations - Partition P into sub-populations that use supersets of physically-distinct resources e. g. size |P|/2 to designate physical FPGA left-half or right-half resource utilization 2. Fitness Assessment - Discrepancy Operator is some function of bitwise agreement between each half’s output fitness assessment via pairwise discrepancy - Four Fitness States defined for Configurations as (temporal voting vs. {CP, CS, CU, CR} with transitions, respectively: spatial voting) Pristine Suspect Under Repair Refurbished - Fitness Evaluation Window W determines comparison interval 3. Regeneration - Genetic Operators used to recover from fault based on Reintroduction Rate - Operators only applied once then offspring returned to “service” without for concer about increasing fitness
Configuration Health States Transitions during lifetime of ith Half-Configuration
Procedural Flow under Competitive Runtime Reconfiguration Integrates all fault handling stages using EC strategy - Detects faults by the occurrence of discrepancy Isolates faults by accumulation of discrepancies Failure-specific refurbishment using Genetic Operators: l Intra-Module-Crossover, Inter-Module-Crossover, Intra-Module-Mutation Realize online device refurbishment - Refurbished online without additional function or resource test vectors Repair during the normal data throughput process
Fitness Evaluation Window • Fitness Evaluation Window: W - denotes number of iterations used to evaluate fitness before the state of an individual is determined • Determination of W for 3 x 3 multiplier - 6 input pins articulating 26=64 possible inputs - W should be selected so that all possible inputs appear - More formally, l Let rand(X) return some xi X at random l Seek W W : [ i=1 • x. K = distinct orderings of K inputs showing in D trials • if D constant, can calculate Pk>1 successively • probability PK of K inputs showing after D trials is ratio of x. K / KD rand(X) ] = X with high probability
W Determination When K=64:
Integer Multiplier Case Study • 3 bit x 3 bit unsigned multiplier automated design: – Building blocks - Half-Adder: 18 templates created - Full-Adder: 24 templates - Parallel-And : 1 template created – Randomly select templates for instantiation in modules GA parameters GA operators Population size : 20 individuals Crossover rate : 5% Mutation rate : up to 80% per bit Experimental Evaluation Xilinx Virtex II Pro on Avnet PCI board External-Module-Crossover Internal-Module-Mutation Experiments Demonstrate … • • • Objective fitness function replaced by the Consensus-based Evaluation Approach and Relative Fitness Elimination of additional test vectors Temporal Assessment process
Template Fault Coverage Half-Adder Template A Half-Adder Template B Template A – – Gate 3 is an AND gate Will lose correctness if a Stuck-At-Zero fault occurs in second input line of the Gate 3, an AND gate Template B – – Gate 3 is a NOT gate and only uses the first input line Will work correctly even if second input line is stuck at Zero or One
Regeneration Performance Parameters: Difference (vs. Hamming Distance) Evaluation Window, Ew = 600 Suspect Threshold: S = 1 -6/600=99% Repair Threshold: R = 1 -4/600 = 99. 3% Re-introduction rate: r = 0. 1 Repairs evolved in-situ, in real-time, without additional test vectors, while allowing device to remain partially online.
Isolation of a single faulty individual with 1 -out-of-64 impact • • • Outliers are identified after W iterations elapsed E. V. = (1/64)*600 = 9. 375 from minimum impact faulty individual Isolated individual’s f differs from the average DV by 3 after 1 or more observation intervals of length W
be42ec375d9f40eb62fd8fc70c89de77.ppt