Скачать презентацию EXPLODE a Lightweight General System for Finding Serious Скачать презентацию EXPLODE a Lightweight General System for Finding Serious

2fd3fee068ff58ff31aa7dc3b25072e8.ppt

  • Количество слайдов: 34

EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors Junfeng Yang, Can EXPLODE: a Lightweight, General System for Finding Serious Storage System Errors Junfeng Yang, Can Sar, Dawson Engler Stanford University

Why check storage systems? q Storage system errors are among the worst § q Why check storage systems? q Storage system errors are among the worst § q Complicated code, hard to get right § q kernel panic, data loss and corruption Simultaneously worry about speed, failures and crashes Hard to comprehensively test for failures, crashes Goal: comprehensively check many storage systems with little work

EXPLODE summary q Comprehensive: uses ideas from model checking q Fast, easy § § EXPLODE summary q Comprehensive: uses ideas from model checking q Fast, easy § § q General, real: check live systems. § q Can run, can check, even without source code Effective § § q Check a storage system: 200 lines of C++ code Main requirement: 1 device driver. Run on Linux, Free. BSD checked 10 Linux FS, 3 version control software, Berkeley DB, Linux RAID, NFS, VMware GSX 3. 2/Linux Bugs in all, 36 in total, mostly data loss Subsumes our old work Fi. SC [OSDI 2004]

Checking complicated stacks q q All real Stack of storage systems § subversion: an Checking complicated stacks q q All real Stack of storage systems § subversion: an open-source version control software User-written checker on top Recovery tools run after EXPLODEsimulated crashes subversion checker ok? subversion %svnadm. recover NFS client loopback crash NFS server JFS software RAID 1 checking disk %fsck. jfs %mdadm --assemble --run --force --update=resync %mdadm -a crash disk

Outline q Core idea q Checking interface q Implementation q Results q Related work, Outline q Core idea q Checking interface q Implementation q Results q Related work, conclusion and future work

Core idea: explore all choices q q Bugs are often triggered by corner cases Core idea: explore all choices q q Bugs are often triggered by corner cases How to find: drive execution down to these tricky corner cases When execution reaches a point in program that can do one of N different actions, fork execution and in first child do first action, in second do second, etc.

External choices q Fork and do every possible operation at re c /root a External choices q Fork and do every possible operation at re c /root a c b link unlink mkd ir rm di r … Users write code to check. EXPLODE “amplifies” the checks … Explore generated states as well Speed hack: hash states, discard if seen

Internal choices q Fork and explore internal choices at re c kmalloc returns 0 Internal choices q Fork and explore internal choices at re c kmalloc returns 0 /root a c b Buffer cache misses

How to expose choices q q To explore N-choice point, users instrument code using How to expose choices q q To explore N-choice point, users instrument code using choose(N): N-way fork, return K in K’th kid void* kmalloc(size s) { if(choose(2) == 0) return NULL; … // normal memory allocation } q We instrumented 7 kernel functions in Linux

Crashes q Dirty blocks can be written in any order, crash at any point Crashes q Dirty blocks can be written in any order, crash at any point cr at e /root a c b buffer cache Users write code to check recovered FS fsck Write all subsets check fsck check

Outline q Core idea: explore all choices q Checking interface § § What EXPLODE Outline q Core idea: explore all choices q Checking interface § § What EXPLODE provides What users do to check their storage system q Implementation q Results q Related work, conclusion and future work

What EXPLODE provides q q choose(N): conceptual N-way fork, return K in K’th child What EXPLODE provides q q choose(N): conceptual N-way fork, return K in K’th child execution check_crashes_now(): check all crashes that can happen at the current moment § § q Paper has more methods for checking crashes Users embed non-crash checks in their code. EXPLODE amplifies them err(): record trace for deterministic replay

What users do q q Example: ext 3 on RAID checker: drive ext 3 What users do q q Example: ext 3 on RAID checker: drive ext 3 to do something: mutate(), then verify what ext 3 did was correct: check() storage component: set up, repair and tear down ext 3, RAID. Write once per system assemble a checking stack

q FS Checker § q q mutate ext 3 Component Stack choose(4) creat file q FS Checker § q q mutate ext 3 Component Stack choose(4) creat file sync fsync rm file …/0 1 mkdir 2 3 rmdir 4 …/0 1 2 3 4

q FS Checker § q q Check file exists check ext 3 Component Stack q FS Checker § q q Check file exists check ext 3 Component Stack Check file contents match Found JFS fsync bug, caused by reusing directory inode as file inode Checkers can be simple (50 lines) or very complex(5, 000 lines) Whatever you can express in C++, you can check

q q FS Checker storage component: initialize, repair, set up, and tear down your q q FS Checker storage component: initialize, repair, set up, and tear down your system § q ext 3 Component threads(): returns list of kernel thread IDs for deterministic error replay q q Stack Wrappers to existing utilities q Write once per system q Real code on next slide

q q q FS Checker ext 3 Component Stack q q q FS Checker ext 3 Component Stack

q q q FS Checker ext 3 Component q q Stack q assemble a q q q FS Checker ext 3 Component q q Stack q assemble a checking stack Let EXPLODE know how subsystems are connected together, so it can initialize, set up, tear down, repair the entire stack Real code on next slide

q q q FS Checker ext 3 Component Stack q q q FS Checker ext 3 Component Stack

Outline q Core idea: explore all choices q Checking interface: 200 lines of C++ Outline q Core idea: explore all choices q Checking interface: 200 lines of C++ to check a system q Implementation § § § Checkpoint and restore states Deterministic replay Checking process Checking crashes Checking “soft” application crashes q Results q Related work, conclusion and future work

Recall: core idea q “Fork” at decision point to explore all choices state: a Recall: core idea q “Fork” at decision point to explore all choices state: a snapshot of the checked system …

How to checkpoint live system? q Hard to checkpoint live kernel memory § q How to checkpoint live system? q Hard to checkpoint live kernel memory § q q checkpoint: record all choose() returns from S 0 restore: umount, restore S 0, re-run code, make K’th choose() return K’th recorded values Key to EXPLODE approach S 0 2 3 S S = S 0 + redo (2, 3) … q VM cloning is heavy-weight

Deterministic replay q Need it to recreate states, diagnose bugs Sources of non-determinism q Deterministic replay q Need it to recreate states, diagnose bugs Sources of non-determinism q Kernel choose() can be called by other code § q Kernel threads § § q q Fix: filter by thread IDs. No choose() in interrupt Opportunistic hack: setting priorities. Worked well Can’t use lock: deadlock. A holds lock, then yield to B Other requirements in paper Worst case: non-repeatable error. Automatic detect and ignore

EXPLODE: put it all together EXPLODE User code EKM = EXPLODE device driver EXPLODE: put it all together EXPLODE User code EKM = EXPLODE device driver

Outline q q Core idea: explore all choices Checking interface: 200 lines of C++ Outline q q Core idea: explore all choices Checking interface: 200 lines of C++ to check a system q Implementation q Results § § q Lines of code Errors found Related work, conclusion and future work

EXPLODE core lines of code Linux 1, 915 (+ 2, 194 generated) Free. BSD EXPLODE core lines of code Linux 1, 915 (+ 2, 194 generated) Free. BSD 1, 210 Kernel patch User-level code 6, 323 3 kernels: Linux 2. 6. 11, 2. 6. 15, Free. BSD 6. 0. Free. BSD patch doesn’t have all functionality yet

Checkers lines of code, errors found Storage System Checked Component Checker Bugs 10 file Checkers lines of code, errors found Storage System Checked Component Checker Bugs 10 file systems 744 5, 477 18 27 68 1 CVS 1 “EXPENSIVE” 30 124 3 82 202 6 RAID Transparent subsystems Not ported yet Berkeley DB Storage applications Subversion 144 FS + 137 2 NFS 34 FS 4 VMware GSX/Linux 54 FS 1 1, 115 6, 008 36 Total

Outline q q Core idea: explore all choices Checking interface: 200 lines of C++ Outline q q Core idea: explore all choices Checking interface: 200 lines of C++ to check a system q Implementation q Results § § q Lines of code Errors found Related work, conclusion and future work

FS Sync checking results indicates a failed check App rely on sync operations, yet FS Sync checking results indicates a failed check App rely on sync operations, yet they are broken

ext 2 fsync bug Events to trigger bug B creat B write B crash! ext 2 fsync bug Events to trigger bug B creat B write B crash! fsck. ext 2 Mem Disk B A … fsync B A … truncate A Indirect block Bug is fundamental due to ext 2 asynchrony

Classic app mistake: “atomic” rename q Atomically update file A to avoid corruption fd Classic app mistake: “atomic” rename q Atomically update file A to avoid corruption fd = creat(A_tmp, …); write(fd, …); fsync(fd); close(fd); rename(A_tmp, A); q Problem: rename guarantees nothing abt. data

Outline q q q Core idea: explore all choices Checking interface: 200 lines of Outline q q q Core idea: explore all choices Checking interface: 200 lines of C++ to check a system Implementation Results: checked many systems, found many bugs Related work, conclusion and future work

Related work q FS testing § q IRON Static analysis § § § Traditional Related work q FS testing § q IRON Static analysis § § § Traditional software model checking Theorem proving Other techniques

Conclusion and future work q EXPLODE § § § q Easy: need 1 device Conclusion and future work q EXPLODE § § § q Easy: need 1 device driver. simple user interface General: can run, can check, without source Effective: checked many systems, 36 bugs Future work: § § § Work closely with storage system implementers to check more systems and more properties Smart search Automatic diagnosis Automatically inferring “choice points” Approach is general, applicable to distributed systems, secure systems, …