cf382279c92793f6fe83148d3403dd90.ppt
- Количество слайдов: 40
PADS: A Domain-Specific Language for Processing Ad Hoc Data Kathleen Fisher AT&T Labs Research Robert Gruber Google 1
Data, data, everywhere! Incredible amounts of data stored in well-behaved formats: Databases: XML: Tools • • • Summer School 2005 Schema Browsers Query languages Standards Libraries Books, documentation Conversion tools Vendor support Consultants… 2
… but not all data is well-behaved! Vast amounts of chaotic ad hoc data: Tools • • • Summer School 2005 Perl? Awk? C? 3
Ad hoc data from www. geneontology. org !date: Fri Mar 18 21: 00: 28 PST 2005 !version: $Revision: 3. 223 $ !type: % is_a is a !type: < part_of part of !type: ^ inverse_of inverse of !type: | disjoint_from disjoint from $Gene_Ontology ; GO: 0003673
Ad hoc in biology: Newick format ((raccoon: 19. 19959, bear: 6. 80041): 0. 84600, ((sea_lion: 11. 99700, seal: 12. 00300): 7. 52973, ((monkey: 100. 85930, cat: 47. 14069): 20. 59201, weasel: 18. 87953): 2. 09460): 3. 87382, dog: 25. 46154); (Bovine: 0. 69395, (Gibbon: 0. 36079, (Orang: 0. 33636, (Gorilla: 0. 17147, (Chi mp: 0. 19268, Human: 0. 11927): 0. 08386): 0. 06124): 0. 15057): 0. 54939, Mouse: 1. 21460): 0. 10; (Bovine: 0. 69395, (Hylobates: 0. 36079, (Pongo: 0. 33636, (G. _Gorilla: 0. 1714 7, (P. _paniscus: 0. 19268, H. _sapiens: 0. 11927): 0. 08386): 0. 06124): 0. 15057): 0. 54939, Rodent: 1. 21460); Summer School 2005 5
Ad hoc data in chemistry O=C([C@@H]2 OC(C)=O)[C@@]3(C)[C@](CO 4) (OC(C)=O)[C@H]4 C[C@@H]3 O)([H])[C@H] (OC(C 7=CC=CC=C 7)=O)[C@@]1(O)[C@@](C)(C)C 2=C(C) [C@@H](OC([C@H](O)[C@@H](NC(C 6=CC=CC=C 6)=O) C 5=CC=CC=C 5)=O)C 1 Summer School 2005 6
Ad hoc data from www. investors. com Date: 3/21/2005 1: 00 PM PACIFIC Investor's Business Daily ® Stock List Name: DAVE Stock Company Symbol Name Price Volume EPS RS Price Change % Change Rating AET Aetna Inc 73. 68 -0. 22 0% 31% 64 93 GE General Electric Co 36. 01 0. 13 0% -8% 59 56 HD Home Depot Inc 37. 99 -0. 89 -2% 63% 84 38 IBM Intl Business Machines 89. 51 0. 23 0% -13% 66 35 INTC Intel Corp 23. 50 0. 09 0% -47% 39 33 Data provided by William O'Neil + Co. , Inc. © 2005. All Rights Reserved. Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc. Reproduction or redistribution other than for personal use is prohibited. All prices are delayed at least 20 minutes. Summer School 2005 7
Ad hoc binary data: DNS packets 0000: 9192 d 8 fb 8480 0001 05 d 8 0000 0872 00000010: 6573 6561 7263 6803 6174 7403 636 f 6 d 00 00000020: 00 fc 0001 c 0006 0001 0000 0 e 10 0027 00000030: 036 e 7331 c 00 c 0 a 68 6 f 73 746 d 6173 7465 00000040: 72 c 0 0 c 77 64 e 5 4900 000 e 1000 0003 8400 00000050: 36 ee 8000 000 e 10 c 00 0 f 00 0100 000 e 00000060: 1000 0 a 05 6 c 69 6 e 75 78 c 0 0 c 00 00000070: 0 f 00 0100 000 e 1000 0 c 00 0 a 07 6 d 61 696 c 00000080: 6 d 61 6 ec 0 0 c 00 0100 000 e 1000 00000090: 0487 cf 1 a 16 c 0 0 c 00 0200 0100 000 e 1000 000000 a 0: 0603 6 e 73 30 c 00 0200 0100 000 e 000000 b 0: 1000 02 c 0 2 e 03 5 f 67 63 c 0 0 c 00 2100 000000 c 0: 0002 5800 1 d 00 0000 640 c c 404 7068 7973 000000 d 0: 0872 6573 6561 7263 6803 6174 7403 636 f Summer School 2005 . . . . r esearch. att. com. . . . '. ns 1. . . hostmaste r. . wd. I. . 6. . . . . linux. . . . mail man. . . . ns 0. . . . _gc. . . !. . . X. . . d. . . phys. research. att. co 8
Ad hoc data from AT&T Name & Use Web server logs (CLF): Measure web workloads Representation Fixed-column ASCII records Size 12 GB/week Sirius data: Monitor Variable-width ASCII service activation records 2. 2 GB/week Call detail: Detect fraud ~7 GB/day Fixed-width binary records Altair data: billing process Track Various Cobol data formats Regulus data: Monitor IP network Netflow: IP network ASCII ~4000 files/day 15 sources, GB/day ~15 Monitor Data-dependent number >1 Gigabit/second of fixed-width binary records Summer School 2005 9
Technical challenges • Data arrives “as is. ” • Documentation is often out-of-date or nonexistent. – Hijacked fields. – Undocumented “missing value” representations. • Data is buggy. – Missing data, human error, malfunctioning machines, race conditions on – – log entries, “extra” data, … Processing must detect relevant errors and respond in applicationspecific ways. Errors are sometimes the most interesting portion of the data. • Data sources often have high volume. – Data may not fit into main memory. Summer School 2005 10
Your Turn: Alarm Data for Network • • • Alarm type: RTE or CPU • Operator comment Severity: an integer between 1 and 4, inclusive Optional timestamp: DD/MM/YYYY: HH: MM: SS TZONE IP address of node exhibiting problem Payload – if RTE alarm: • number of nodes in path • list of IP addresses in path – if CPU, an integer identifying the relevent CPU RTE 2|12/Dec/2003: 23: 10: 29 -0700|123. 45. 65. 233__2: 240. 230. 155. 13; 240. 120. 11. 13|intermittent CPU 4||200. 42. 110__1346575|unusual workload Summer School 2005 11
Existing approaches • Lex/Yacc – No one we have encountered uses them for ad hoc data. • Perl/C – Code brittle with respect to changes in input format. – Analysis often ends up interwoven with parsing, precluding reuse. – Error code, if written, swamps main-line computation. If not written, – errors can corrupt “good” data. Everything has to be coded by hand. • Data description languages (Packet. Types, Datascript) – Binary data – Focus on correct data. Summer School 2005 12
Our approach: PADS Data expert writes declarative description of data source: – Physical format information – Semantic constraints Many data consumers use description and generated parser. • Description serves as living documentation. • Parser exhaustively detects errors without cluttering user code. • From declarative specification, we generate auxiliary tools. Summer School 2005 13
PADS architecture Summer School 2005 14
PADS architecture Summer School 2005 15
PADS architecture Summer School 2005 16
PADS language Type-based model: types indicate how to process associated data. • Provides rich and extensible set of base types. – Pint 8, Puint 8, … – Pstring(: ’|’: ) // -123, 44 Pstring_FW(: 3: ) Pstring_ME(: ”/a*/”: ) – Pdate, // hello | // catdog // aaaaaab Ptime, Pip, … • Provides type constructors to describe data source structure: • Pstruct, Parray, Punion, Ptypedef, Penum • Allows arbitrary predicates to describe expected properties. Summer School 2005 17
Running example: CLF web log • Common Log Format from Web Protocols and Practice. 207. 136. 97. 50 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 • Fields: – IP address of remote host – Remote identity (usually ‘-’ to indicate name not collected) – Authenticated user (usually ‘-’ to indicate name not collected) – Time associated with request – Request (request method, request-uri, and protocol version) – Response code – Content length Summer School 2005 18
Example: Pstruct Precord Pstruct http_weblog { host client; /' '; auth_id remote. ID; /' '; auth_id auth; /“ [”; Pdate(: ']': ) date; /“] ”; http_request; /' '; Puint 16_FW(: 3: ) response; ' '; Puint 32 content. Length; }; Client requesting service Remote identity Name of authenticated user Timestamp of request Request /- 3 -digit response code /- Bytes in response 207. 136. 97. 50 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 Summer School 2005 19
User constraints int chk. Version(http_v version, method_t meth) { … Pstruct '"'; '"'; }; http_request { method_t meth; Pstring(: ' ': ) req_uri; http_v version : chk. Version(version, meth); 207. 136. 97. 50 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 Summer School 2005 20
Example: Punion id { Pchar unavailable : unavailable == '-'; Pstring(: ' ': ) id; }; 207. 136. 97. 50 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 • Union declarations allow the user to describe variations. • Implementation tries branches in order. • Stops when it finds a branch whose constraints are all true. Summer School 2005 21
Example: Switched Punion info_t (: code_t code : ) { Pswitch (code) { Pcase RTE : rte_t rte; Pcase CPU : Puint 32 cpu; } }; RTE 2|12/Dec/2003: 23: 10: 29 -0700|123. 45. 65. 233__ 2: 240. 230. 155. 13; 240. 120. 11. 13 |intermittent CPU 4||200. 42. 110__1346575|unusual workload • Switched union uses expression to choose correct branch. Summer School 2005 22
Example: Popt Puint 32_opt Puint 32; Pstruct alarm_t { … Puint 32_opt length; Popt Puint 32 length 2; … }; • Popt expresses optional data • Can be used in-line as length 2 field shows • Encodable as a Punion Summer School 2005 23
Example: Parray host { Puint 8[4]: Psep(‘. ’); }; 207. 136. 97. 50 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 Array declarations allow the user to specify: • Size (fixed, lower-bounded, upper-bounded, unbounded) • Psep, Pterm, and termination predicates • Constraints over sequence of array elements Array terminates upon exhausting EOF, reaching terminator, reaching maximum size, or satisfying termination predicate. Summer School 2005 24
Example: Penum method_t { GET, PUT, POST, HEAD, DELETE, LINK, /- Unused after HTTP 1. 0 UNLINK /- Unused after HTTP 1. 0 }; 207. 136. 97. 50 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 Enumerations are strings on disk, 32 -bit integers in memory. Summer School 2005 25
Example: Ptypedef Puint 16_FW(: 3: ) response_t : response_t x => { 100 <= x && x < 600}; 207. 136. 97. 50 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 Typedefs allow the user to add constraints to existing types. Summer School 2005 26
Example: Pcompute and Pomit Pstruct log_t { Puint 32 pn; ' '; Pcompute Puint 8 is. Intl = is. Intl(pn); ' '; Pomit Pstring(: ’|’: ); … }; • Pcompute allows the user to set part of the internal representation from a computed value, which can depend upon “earlier” data. • Pomit excludes the annotated data from the representation. Summer School 2005 27
CLF in PADS Parray Phostname{ Pstring_SE(: "/[. ]/": ) [] : Psep('. ') && Pterm(Pnosep); }; Punion client_t { Pip ip; Phostname host; }; /- 135. 207. 23. 32 /- www. research. att. com Pstruct request_t { '"'; method_t meth; ' '; Pstring(: ' ': ) req_uri; ' '; version_t version : chk. Version(version, meth); '"'; }; Ptypedef Puint 16_FW(: 3: ) response_t : response_t x => { 100 <= x && x < 600}; Punion auth_id_t { Pchar unauthorized : unauthorized == '-'; Pstring(: ' ': ) id; }; Penum method_t { GET, PUT, POST, HEAD, DELETE, LINK, UNLINK }; Punion length_t { Pchar unavailable : unavailable == '-'; Puint 32 len; }; Precord Pstruct entry_t client_t ' '; auth_id_t " ["; Pdate(: ']': ) "] "; request_t ' '; response_t ' '; length_t }; Pstruct version_t { "HTTP/"; Puint 8 major; '. '; Puint 8 minor; }; int chk. Version(version_t v, method_t m) { if ((v. major == 1) && (v. minor == 1)) return 1; if ((m == LINK) || (m == UNLINK)) return 0; return 1; }; { client; remote. ID; auth; date; request; response; length; Psource Parray clt_t { entry_t []; } Summer School 2005 28
Your Turn (Part 2) • • • Alarm type: RTE or CPU • Operator comment Severity: an integer between 1 and 4, inclusive Optional timestamp: DD/MM/YYYY: HH: MM: SS TZONE IP address of node exhibiting problem Payload – if RTE alarm: • number of nodes in path • list of IP addresses in path – if CPU, an integer identifying the relevent CPU RTE 2|12/Dec/2003: 23: 10: 29 -0700|123. 45. 65. 233__2: 240. 230. 155. 13; 240. 120. 11. 13|intermittent CPU 4||200. 42. 110__1346575|unusual workload Summer School 2005 29
Your turn! (One answer) Penum code_t {RTE, CPU }; Pstruct rte_t {Puint 32 num; ‘; ’; Pip[num]route : Psep(‘; ’); }; Punion info_t (: code_t code : ) { Pswitch (code) { Pcase RTE : rte_t rte; Pcase CPU : Puint 32 cpu; } }; Precord Pstruct alarms_t { code_t code; ‘ ’; Puint 32 severity : 1 <= severity && severity <= 4; ‘ ’; Popt Pdate(: ’|’: ) timestamp; ‘|’; Pip addr; “__”; into_t(: code: ) info; ‘|’; Pstring(: ‘n’: ) comment; }; Summer School 2005 30
PADS parsing Perror_t entry_t_read(P_t *pdc, entry_t_m* mask, entry_t_pd* pd, entry_t* rep); Invariant: If mask is “check and set” and parse descriptor reports no errors, then the in-memory representation is correct. Summer School 2005 31
Leverage! • Convert PADS description into a collection of tools: – Accumulators – Histograms – Clustering tool – Formatters – Translator into XML, with corresponding XML Schema. – XQueries using Galax’s data interface –… • Long term goal: Provide a compelling suite of tools to overcome inertia of a new language and system. Summer School 2005 32
Accumulators • Statistical profile of “leaves” in a data source:
Pretty printer • Customizable program to reformat data: 207. 136. 97. 49 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /tk/p. txt HTTP/1. 0" 200 30 tj 62. aol. com - - [16/Oct/1997: 14: 32: 22 -0700] "POST /scpt/dd@grp. org/confirm HTTP/1. 0" 200 941 Normalize time zones Normalize delimiters Drop unnecessary values Filter/repair errors 207. 136. 97. 49|-|-|10/16/97: 01: 46: 51|GET| /tk/p. txt|1|0|200|30 tj 62. aol. com|-|-|10/16/97: 21: 32: 22|POST| /scpt/dd@grp. org/confirm|1|0|200|941 • Users can override pretty printing on a per type basis. • Used by AT&T’s Regulus project to normalize monitoring data before loading into a relational database. Summer School 2005 34
Implementation Summer School 2005 35
Why a domain specific language? • Dramatically shorter code (68 lines versus ~7. 9 K lines). • Description is short enough to serve as documentation. • Safer: error code inserted automatically and completely (we just have to get the compiler right…). • We can leverage the declarative specification to produce value-added tools. • Vision: on-line repositories of data use PADS to describe data. Users can download data and generate tool suite as desired. Summer School 2005 36
Observations on the language • Types are a very useful way of thinking about semi-structured data. – Recursive types, pointers, and type functions are all in the works. • Data description needs dependencies rather than nondeterminism: – Length of an array, branch of a union, semantic checks. • Record boundaries and literals provide good error recovery. Summer School 2005 37
Future research directions • • Design – – Implementation – – • • How can we specify error-aware data transformations? Can we infer a data transformation between two descriptions? How can we express application-specific information? How can we generate template programs for arbitrary data source? How can we specialize generated libraries to incorporate application-specific information? How can we optimize streaming XQueries? Theory – – How do we precisely specify the semantics of PADS? (…tomorrow…) What is the expressiveness of PADS vs. context free grammars? Engineering – – How do we build the system to make it easy to add new base types? New libraries and tools? New language bindings? Summer School 2005 38
The (growing!) PADS team • Kathleen Fisher (AT&T) • Robert Gruber (Google) • Mary Fernandez (AT&T) • Joel Gottlieb (AT&T) • Yitzhak Mandelbaum (Princeton) • Martin Strauss (University of Michigan) • David Walker (Princeton) • Xuan Zheng (University of Michigan) Summer School 2005 39
Try it! • Available for download with a non-commercial use source code license. • Demo of accumulators, format program, and XML conversion. • Send us feedback! www. padsproj. org Summer School 2005 40