a96c8e414e7c1b5eaa2d0e3b3e917c69.ppt
- Количество слайдов: 33
Coda Server Internals Peter J Braam
Contents z. Data structure overview z. Volumes z. Vnodes z. Inodes
Data Structure Overview Object Inodes Purpose File Contents Resides where /vicep* partitions Volumes Vnodes Directory cnts ACL Reslogs Meta Data & Dir contents RVM Volinfo records Volume location VLDB, VRDB: RW db files Security VSGDB, . pdb, . tk files: dynamic RO db files Configuration Data Static data VSGDB Pdb records Tokens Servers/SCM Partitions Startup flags Skipvolumes LOG & DATA & DB Locators
RVM layout (coda_globals. h) z Already_initialized (int) z struct Vol. Head[MAXVOLS] z struct Vnode. Disk. Object *Small. Vnode. Free. Lists[SM_FREESIZE] z short Small. Vnode. Index z …. Same for large … z Max. Vol. Id (unsigned long) z Remainder is dynamically allocated
Volume zoo z. RVM: structures y. Volume. Data y. Vol. Head y. Volume. Header y. Volume. Disk. Data (volume. h, camprivate. h) z. VM: structures y. Volume. Info ……. .
A volume in RVM Vol. Head Volume. Header stamp id parentid type Volume. Data *volume. Disk. Data *small. Vnode. Lists nsmall. Vnodes nsmall. Lists -- same for big -- contains pointer to rvm malloced data
Volume. Disk. Data (rvm) z. Lots of stuff: y. Identity & location: partition, name, yruntime info: use, in. Service, blessed, salvaged y. Vnode related: next uniquefier y. Versionvector y. Resolution flags, pointer to recov_vol_log y. Quota y. Resource usage: filecount, diskused etc
Volumes in VM zstruct Volumes sit in Vol. Hash with copies of RVM data structures z. Salvage before “attaching” to Vol. Hash z. Model of operation (FS): y. Get. Volume copy out from RVM y. Do your mods in VM y. Put. Volume does RVM transaction z. Model of operation (Volutil): y operate on RVM
Volumes in Venus RPC’s z. One RPC: Get. Vol. Info yused for mount point traversal z. Only relates to yvolume location database yvolume replication database y. VSGDB z. Could sit in separate Volume Location Server
Vnodes (cvnode. h) z. Small & large: large for directories ydifference is ACL at back of large vnodes z. Inode field: ysmall vnodes: points to diskfile inode number ylarge vnodes: is RVM address of dir inode z. Contain important small structure: vv_t z. Pointers to reslog entries z. VM: cvnode’s with hash table, freelists etc
Vnodes in RVM z. RVM: Vnode. Diskinfo (rvm_malloced) zvnodes sit on rec_smolists yeach link points to a Disk. Vnode ylists link vnodes with identical vnodenumbers but different uniquefiers ynew vnodes grabbed from Free. Lists (index. cc, recov{a, b, c}. cc) yvolumes have arrays of rec_smolists which grow when they are full
Vnodes in action z. Model: y. Get. FSObj calls Get. Vnode ywork is done y. Put. FS Objects calls xrvm_begin_transaction x. Replace. Vnode - copies data from VM to RVM xrvm_end_transaction z Getting a vnode takes 3 pointer derefs, possibly 3 page faults vs. 1 for local file systems. z Is this necessary? Probably not. Cure it: yes!
Directories (rvm) z. Dir. Inode ypage table and “copy on write” refcount z. Dir. Pages 2048 bytes each ybuild up the directory ydivided into 64 32 byte blobs y. Hash table for fast name lookups y. Blob Freelist y. Array of free blobs per page
Directories z. More than one vnode can point to directory (copy on write) z. VM: hash table of Dir. Handles ypoint to VM contiguous copy of dir ypoint to Dir. Inode yhave a lock etc z. Model: as for volumes & vnodes z. Critique: too baroque
Files z. Vnode references file by Inode. Number z. Files are copy on write z. There are “File. Inodes” like dir inodes, but they are held in external DB or in inode itself z. Server always reads/writes whole files (could be exploited)
Volinit and salvage z. Set up volume hash table, serverlist, Disk. Partition. List z. Cycle through partitions, check each for ylist of inodes yevery inode has a vnode yevery vnode has a directory name yevery directory name has a vnode z. Put volume in a VM hash table
Server connection info z. Array of Host. Entry (a “venus”) y. Contains a linked list of connections y. Contains a callback connection id z. Connection setup yfirst binding creates a host & callback conn ynew binding creates a new connection and verifies callback yin RPC 2_New. Binding & Vice. New. Connect. FS
Callbacks z. Hashtable of File. Entries: yeach contains Fid ynumber of users ylinked list of callbacks z. Callbacks: point to Host. Entry z. Ops: y. RPC: Break. Call. Back y. Local: placing, delete. Venus
Callbacks z. Connection is non-authenticated. Should be fixed. Session key for CB connection should not expire. z. Side effect of callback connection is used for Back. Fetch bulk transfer of files during reintegration.
RPC processing z. Venus RPC’s: ysrvproc. cc - standard file ops ysrvproc 2. cc - standard volume ops ycodaproc. cc - repair stuff ycodaproc 2. cc - reintegration stuff z. Volutil RPC’s: yvol-your-rpc. cc (in coda-src/volutil) z. Resolution: below
RPC processing z. RPC structure: y. Validate. Parms: validate, hand off COP 2, cid y. Get. Object: vm copy, lock objects y. Check. Semantics: x. Concurrency, Integrity, Permissions y. Perform operations: x. Bulk. Transfer, Update. Objects, Out. Parms y. Put. Object: rvm transactions, inode deletions
vlists z. Get. FSObjects: instantiate a vlist y. RPC needs list of objects copied from RVM y. Modification status is held there (did Copy. On. Write kick in etc) z. Put. Objects yrvm_begin_transaction ywalk through the list, copy, rvm_set_range, unlock yrvm_end_transaction
COP 2 handling z. In COP 2 Venus give final VV to server zare sent out by Venus (with some delay) often piggybacked in bulk zserver knows about pending COP 2 entries in hash table (coppend. cc) z. Manager thread Cop. Pending. Manager y. Runs every minute. y. Removes entries more than 900 secs old
Cop 2 to RVM z. Data can be y. Piggy. Backed on another rpc ysent in Vice. Cop 2 rpc. z. Both cases call Internal. Cop 2 (srvproc. cc) z. Internal. Cop 2 (codaproc. cc) ynotifies the manager to dequeue ygets the FS objects listed for the COP 2 yinstalls final VV’s into RVM (rvm transaction!)
COP 2 Problems z. Easy cause of conflicts in replicated volumes when clients access objects in rapid succession. (Can be fixed easily during the writeback caching operation) z. Not optimized for singly replicated volume.
Resolution z. Initiated by client with RPC to coordinator y. Vice. Resolve (codaproc. cc) zcoordinator ysets up connections in VSG (unauthenticated) y. Lock. And. Fetch (res/reslock, resutil): xlock volumes, xcollect “closure”
Resolution - special cases z. Reg. Res. Dir. Required (rvmres/rvmrescoord. cc) zcheck for yunresolved ancestors yalready inconsistent yrunts (missing objects) yweak equality (identical storeid)
Recov. Dir. Resolve z. Phase II: (rvmres/{rescoord, subphase? }. cc) ycoordinator request logs from other servers ysubordinates lock affected dirs, marshall logs ycoordinator merges logs z. Phase III: yship merged log to subordinates yperform operations on VM copies y. Return results to coordinator
Resolution z. Phase IV: (is old Phase 3 …) ycollect results, compute new VV’s ship to subordinates ycommit results
Comments on resolution z. Old versions of resolution: y. Old. Dir. Resolve: resolve only runts and weak y. Dir. Resolve: resolve only in VM y. Remove these zresolve directory has nothing to do with resolution: should be called librepair. Srv uses merely one function in it - repair uses the rest
Volume Log z. During FS operations, log entries are created for use during resolution z. Different format per operation (rvmres/recov_vollog. cc) z. Added to the vlist by Spool. VMLog. Record z. Put in RVM at commit time
Repair z. Venus makes Vice. Repair RPC. y. File and symlink repair: Bulk. Transfer the object y. Directory repair, Bulk. Transfer the repair file and replay operations y. Venus follows this with a COP 2 multi rpc y. For directory repair Venus invokes asynchronous resolve
Future z. Good: y. Design is simple and efficient y. There is little C++: should eliminate yeasy to multi-thread z. Bad: y. Scalability ~8 GB in practice, ~40 GB in theory y. Data handling is bad: tricky to fix y. Volume code was & is worst: rewrite
a96c8e414e7c1b5eaa2d0e3b3e917c69.ppt