9636a0045d776421dfd9765d0c3de58d.ppt
- Количество слайдов: 47
Scaling Up Data Intensive Science to Campus Grids Douglas Thain Clemson University 25 Septmber 2009 1
http: //www. cse. nd. edu/~ccl 2
The Cooperative Computing Lab We collaborate with people who have large scale computing problems. We build new software and systems to help them achieve meaningful goals. We run a production computing system used by people at ND and elsewhere. We conduct computer science research, informed by real world experience, with an impact upon problems that matter. 3
What is a Campus Grid? A campus grid is an aggregation of all available computing power found in an institution: – Idle cycles from desktop machines. – Unused cycles from dedicated clusters. Examples of campus grids: – 700 CPUs at the University of Notre Dame – 9000 -11, 000 CPUs at Clemson University – 20, 000 CPUs at Purdue University 4
Provides robust batch queueing on a complex distributed system. Resource owners control consumption: – “Only run jobs on this machine at night. ” – “Prefer biology jobs over physics jobs. ” End users express needs: – “Only run this job where RAM>2 GB” – “Prefer to run on machines http: //www. cs. wisc. edu/condor 5
6
7
8
9
10
Clusters, clouds, and grids give us access to unlimited CPUs. How do we write programs that can run effectively in large systems? 11
Example: Biometrics Research Goal: Design robust face comparison function. F F 0. 97 0. 05 12
Similarity Matrix Construction 1. 0 0. 8 0. 1 0. 0 0. 1 1. 0 0. 1 0. 0 1. 0 0. 1 0. 3 1. 0 0. 0 1. 0 0. 1 Challenge Workload: 60, 000 iris images 1 MB each. 02 s per F 833 CPU-days 600 TB of I/O 1. 0 13
I have 60, 000 iris images acquired in my research lab. I want to reduce each one to a feature space, and then compare all of them to each other. I want to spend my time doing science, not struggling with computers. I own a few machines I can buy time from Amazon or Tera. Grid. I have a laptop. Now What? 14
Non-Expert User Using 500 CPUs Try 1: Each F is a batch job. Failure: Dispatch latency >> F runtime. CPU CPU CPU F F F HN Try 3: Bundle all files into one package. Failure: Everyone loads 1 GB at once. Try 2: Each row is a batch job. Failure: Too many small ops on FS. F F F CPU CPU F F F HN Try 4: User gives up and attempts to solve an easier or smaller problem. F F F CPU CPU F F F HN 15
Observation In a given field of study, many people repeat the same pattern of work many times, making slight changes to the data and algorithms. If the system knows the overall pattern in advance, then it can do a better job of executing it reliably and efficiently. If the user knows in advance what patterns are allowed, then they have a better idea of how to construct their workloads. 16
Abstractions for Distributed Computing Abstraction: a declarative specification of the computation and data of a workload. A restricted pattern, not meant to be a general purpose programming language. Uses data structures instead of files. Provide users with a bright path. Regular structure makes it tractable to model and predict performance. 17
Working with Abstractions A 1 A 2 An A 1 A 2 Bn F All. Pairs( A, B, F ) Compact Data Structure Custom Workflow Engine Cloud or Grid 18
All-Pairs Abstraction All. Pairs( set A, set B, function F ) returns matrix M where M[i][j] = F( A[i], B[j] ) for all i, j B 1 Bn A 1 A 2 A 3 B 1 F F F B 2 A 1 An F F F B 3 F F F allpairs A B F. exe All. Pairs(A, B, F) F 19
How Does the Abstraction Help? The custom workflow engine: – Chooses right data transfer strategy. – Chooses the right number of resources. – Chooses blocking of functions into jobs. – Recovers from a larger number of failures. – Predicts overall runtime accurately. All of these tasks are nearly impossible for arbitrary workloads, but are tractable (not trivial) to solve for a specific abstraction. 20
21
Choose the Right # of CPUs 22
Resources Consumed 23
All-Pairs in Production Our All-Pairs implementation has provided over 57 CPU-years of computation to the ND biometrics research group over the last year. Largest run so far: 58, 396 irises from the Face Recognition Grand Challenge. The largest experiment ever run on publically available data. Competing biometric research relies on samples of 100 -1000 images, which can miss important population effects. Reduced computation time from 833 days to 10 days, making it feasible to repeat multiple times for 24 a graduate thesis. (We can go faster yet. )
25
26
Are there other abstractions? 27
All-Pairs Abstraction All. Pairs( set A, set B, function F ) returns matrix M where M[i][j] = F( A[i], B[j] ) for all i, j B 1 Bn A 1 A 2 A 3 B 1 F F F B 2 A 1 An F F F B 3 F F F allpairs A B F. exe All. Pairs(A, B, F) F 28
Wavefront( matrix M, function F(x, y, d) ) returns matrix M such that M[i, j] = F( M[i-1, j], M[I, j-1], M[i-1, j-1] ) M[0, 4] F x d M[0, 3] M y F x d Wavefront(M, F) M[0, 2] F x F d d y y d d y F x M[3, 2] M[4, 3] y F x F x y d M[0, 1] M[2, 4] M[3, 4] M[4, 4] y F x d M[4, 2] y F x d y M[0, 0] M[1, 0] M[2, 0] M[3, 0] M[4, 0] 29
Some-Pairs Abstraction Some. Pairs( set A, list (i, j), function F(x, y) ) returns list of F( A[i], A[j] ) A 1 A 1 An (1, 2) (2, 1) (2, 3) (3, 3) F A 1 A 2 A 3 F Some. Pairs(A, L, F) A 2 A 3 F F F 30
What if your application doesn’t fit a regular pattern? 31
Makeflow part 1 part 2 part 3: input. data split. py. /split. py input. data out 1: part 1 mysim. exe. /mysim. exe part 1 >out 1 out 2: part 2 mysim. exe. /mysim. exe part 2 >out 2 out 3: part 3 mysim. exe. /mysim. exe part 3 >out 3 result: out 1 out 2 out 3 join. py. /join. py out 1 out 2 out 3 > result 32
Makeflow Implementation bfile: afile prog afile >bfile queue tasks makeflow master worker worker detail of a single worker: work queue tasks done 100 s of workers dispatched to the cloud put prog put afile exec prog afile > bfile get bfile Two optimizations: Cache inputs and output. Dispatch tasks to nodes with data. afile worker prog bfile 33
Experience with Makeflow Reusing a good old idea in a new way. Easy to test and debug on a desktop machine or a multicore server. The workload says nothing about the distributed system. (This is good. ) Graduate students in bioinformatics running codes at production speeds on hundreds of nodes in less than a week. Student from Clemson got complex biometrics workload running in a few weeks. 34
Putting it All Together Abstraction Web Portal Data Repository X F Y Z Campus Grid 35
BXGrid Schema Scientific Metadata Typ e Subjec Eye t Iris S 100 Iris Color File. ID Right Blue 1048 7 replicaid=423 state=ok replicaid=105 state=ok S 203 Iris Left 1048 6 Immutable Replicas Right Brow 2430 n General Metadata 4 S 203 Left Brow 2430 fileid = 24305 n 5 size = 300 K type = jpg sum = abc 123… replicaid=293 state=creating replicaid=102 state=deleting 36
37
38
39
Results from Campus Grid 40
Biocompute 41
42
Parallel BLAST Makeflow 43
44
Abstractions as a Social Tool Collaboration with outside groups is how we encounter the most interesting, challenging, and important problems, in computer science. However, often neither side understands which details are essential or non-essential: – – – Can you deal with files that have upper case letters? Oh, by the way, we have 10 TB of input, is that ok? (A little bit of an exaggeration. ) An abstraction is an excellent chalkboard tool: – – – Accessible to anyone with a little bit of mathematics. Makes it easy to see what must be plugged in. Forces out essential details: data size, execution time. 45
Conclusion Grids, clouds, and clusters provide enormous computing power, but are very challenging to use effectively. An abstraction provides a robust, scalable solution to a narrow category of problems; each requires different kinds of optimizations. Limiting expressive power, results in systems that are usable, predictable, and reliable. Portal + Repository + Abstraction + Grid = New Science Capabilities 46
Acknowledgments Cooperative Computing Lab – http: //www. cse. nd. edu/~ccl Faculty: – – Patrick Flynn Nitesh Chawla Kenneth Judd Scott Emrich Grad Students – – – Chris Moretti Hoang Bui Li Yu Mike Olson Michael Albrecht NSF Grants CCF-0621434 and CNS-0643229 Undergrads – – – – – Mike Kelly Rory Carmichael Mark Pasquier Christopher Lyon Jared Bulosan Kameron Srimoungach Rachel Witty Ryan Jansen Joey Rich 47


