ba267bfe3ae79895fb4b72bcaf577fc9.ppt
- Количество слайдов: 36
PADS: A System for Managing Ad Hoc Data Kathleen Fisher AT&T Labs Research
Data, data, everywhere! Incredible amounts of data stored in well-behaved formats: Databases: Tools • • • XML: AT&T Schema Browsers Query languages Standards Libraries Books, documentation Conversion tools Vendor support Consultants…
… but not all data is well-behaved! Vast amounts of chaotic ad hoc data: Tools • Perl? • Awk? • C? AT&T
Ad Hoc Data in Genetics format-version: 1. 0 date: 11: 2005 14: 24 auto-generated-by: DAG-Edit 1. 419 rev 3 default-namespace: gene_ontology subsetdef: goslim_goa "GOA and proteome slim" [Term] id: GO: 0000001 name: mitochondrion inheritance namespace: biological_process def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton. " [PMID: 10873824, PMID: 11389764, SGD: mcc] is_a: GO: 0048308 ! organelle inheritance is_a: GO: 0048311 ! mitochondrion distribution www. geneontology. org AT&T
Ad Hoc Data in Biology: Newick Format ((raccoon: 19. 19959, bear: 6. 80041): 0. 84600, ((sea_lion: 11. 99700, seal: 12. 00300): 7. 52973, ((monkey: 100. 85930, cat: 47. 14069): 20. 59201, weasel: 18. 87953): 2. 09460): 3. 87382, dog: 25. 46154); (Bovine: 0. 69395, (Gibbon: 0. 36079, (Orang: 0. 33636, (Gorilla: 0. 17147, (Chi mp: 0. 19268, Human: 0. 11927): 0. 08386): 0. 06124): 0. 15057): 0. 54939, Mouse: 1. 21460): 0. 10; (Bovine: 0. 69395, (Hylobates: 0. 36079, (Pongo: 0. 33636, (G. _Gorilla: 0. 1714 7, (P. _paniscus: 0. 19268, H. _sapiens: 0. 11927): 0. 08386): 0. 06124): 0. 15057): 0. 54939, Rodent: 1. 21460); AT&T
Ad Hoc Data in Business HA 0000 START OF TEST CYCLE a. A 00000001 BXYZ U 1 AB 0000040000100 B 0000004200 HL 00000002 START OF OPEN INTEREST d 00000003 FZYX G 1 AB 00000300000 HM 00000004 END OF OPEN INTEREST HE 00000005 START OF SUMMARY f 00000006 NYZX B 1 QB 00052000120000070000 B 000050000000520000 00490000005100+00000100 B 00000005300000052500000535000 HF 00000007 END OF SUMMARY k 00000008 LYXW B 1 KB 0000065 G 0000009900100000001000020000 HB 00000009 END OF TEST CYCLE www. opradata. com AT&T
Ad Hoc Data in Finance Date: 3/21/2005 1: 00 PM PACIFIC Investor's Business Daily ® Stock List Name: DAVE Stock Company Symbol Name Price Volume EPS RS Price Change % Change Rating AET Aetna Inc 73. 68 -0. 22 0% 31% 64 93 GE General Electric Co 36. 01 0. 13 0% -8% 59 56 HD Home Depot Inc 37. 99 -0. 89 -2% 63% 84 38 IBM Intl Business Machines 89. 51 0. 23 0% -13% 66 35 INTC Intel Corp 23. 50 0. 09 0% -47% 39 33 Data provided by William O'Neil + Co. , Inc. © 2005. All Rights Reserved. Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc. Reproduction or redistribution other than for personal use is prohibited. All prices are delayed at least 20 minutes. www. investors. com AT&T
Ad Hoc Binary Data: DNS packets 0000: 9192 d 8 fb 8480 0001 05 d 8 0000 0872 00000010: 6573 6561 7263 6803 6174 7403 636 f 6 d 00 00000020: 00 fc 0001 c 0006 0001 0000 0 e 10 0027 00000030: 036 e 7331 c 00 c 0 a 68 6 f 73 746 d 6173 7465 00000040: 72 c 0 0 c 77 64 e 5 4900 000 e 1000 0003 8400 00000050: 36 ee 8000 000 e 10 c 00 0 f 00 0100 000 e 00000060: 1000 0 a 05 6 c 69 6 e 75 78 c 0 0 c 00 00000070: 0 f 00 0100 000 e 1000 0 c 00 0 a 07 6 d 61 696 c 00000080: 6 d 61 6 ec 0 0 c 00 0100 000 e 1000 00000090: 0487 cf 1 a 16 c 0 0 c 00 0200 0100 000 e 1000 000000 a 0: 0603 6 e 73 30 c 00 0200 0100 000 e 000000 b 0: 1000 02 c 0 2 e 03 5 f 67 63 c 0 0 c 00 2100 000000 c 0: 0002 5800 1 d 00 0000 640 c c 404 7068 7973 000000 d 0: 0872 6573 6561 7263 6803 6174 7403 636 f AT&T . . . . r esearch. att. com. . . . '. ns 1. . . hostmaste r. . wd. I. . 6. . . . . linux. . . . mail man. . . . ns 0. . . . _gc. . . !. . . X. . . d. . . phys. research. att. co
Ad Hoc Data from AT&T Name & Use Representation Size Web server logs (CLF): Measure web workloads Fixed-column ASCII records 12 GB/week Sirius data: Monitor Variable-width ASCII service activation records 2. 2 GB/week Call detail: Detect fraud ~7 GB/day Fixed-width binary records Altair data: billing process Track Various Cobol data formats Regulus data: Monitor IP network Netflow: IP network ASCII ~4000 files/day 15 sources, GB/day ~15 Monitor Data-dependent number >1 Gigabit/second of fixed-width binary records AT&T
Technical Challenges of Ad Hoc Data • Data arrives “as is. ” • Documentation is often out-of-date or nonexistent. – Hijacked fields. – Undocumented “missing value” representations. • Data is buggy. – Missing data, human error, malfunctioning machines, race conditions on log entries, “extra” data, … – Processing must detect relevant errors and respond in applicationspecific ways. – Errors are sometimes the most interesting portion of the data. • Data sources often have high volume. – Data may not fit into main memory. AT&T
Existing Approaches • Lex/Yacc – Over- and under-kill. • Perl/C – Code brittle with respect to changes in input format. qr/^(d+)|(? : [^|]*|){12}(? : [^|]*|)*$STATE|/; – If written, error-detection code swamps main-line computation. If not written, errors can corrupt “good” data. – Everything has to be coded by hand. – Analysis often ends up interwoven with parsing. • Data description languages (Packet. Types, Datascript) – Binary data – Focus on correct data. AT&T
Our Approach: PADS Data expert writes declarative description of data source: – Physical format information – Semantic constraints Many data consumers use description and parser. • Description serves as living documentation. • Parser exhaustively detects errors without cluttering user code. • From description, we generate auxiliary tools. PLDI 2005 AT&T
PADS Architecture AT&T
PADS Architecture AT&T
PADS Architecture AT&T
PADS Language Type-based model: types indicate how to process associated data. • Provides rich and extensible set of base types. – Pint 8, Puint 8, … // -123, 44 – Pstring(: ’|’: ) Pstring_FW(: 3: ) Pstring_ME(: ”/a*/”: ) // hello | // catdog // aaaaaab – Pdate, Ptime, Pip, … • Provides type constructors to describe data source structure: • Pstruct, Parray, Punion, Ptypedef, Penum • Allows arbitrary predicates to describe expected properties. AT&T
Running Example: Web Server Logs • Common Log Format from Web Protocols and Practice. 207. 136. 97. 50 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 • Fields: – – – – IP address of remote host Remote identity (usually ‘-’ to indicate name not collected) Authenticated user (usually ‘-’ to indicate name not collected) Time associated with request Request (request method, request-uri, and protocol version) Response code Content length AT&T
Example: Pstruct Precord Pstruct http_weblog { host client; /' '; auth_id remote. ID; /' '; auth_id auth; /“ [”; Pdate(: ']': ) date; /“] ”; http_request; /' '; Puint 16_FW(: 3: ) response; ' '; Puint 32 content. Length; }; Client requesting service Remote identity Name of authenticated user Timestamp of request Request /- 3 -digit response code /- Bytes in response 207. 136. 97. 50 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 AT&T
Example: Parray Phostname{ Pstring_SE(: "/[. ]/": )[] : Psep('. ') && Pterm(Pnosep); }; www. cnn. com - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 Array declarations allow the user to specify: • Size (fixed, lower-bounded, upper-bounded, unbounded) • Psep, Pterm, and termination predicates • Constraints over sequence of array elements Array terminates upon exhausting EOF, reaching terminator, reaching maximum size, or satisfying termination predicate. AT&T
Example: Punion auth_id { Pchar unavailable : unavailable == '-'; Pstring(: ' ': ) id; }; 207. 136. 97. 50 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 • Union declarations allow the user to describe variations. • Implementation tries branches in order. • Stops when it finds a branch whose constraints are all true. • Switched unions jump to a particular branch based on a selector. AT&T
Example: Ptypedef Puint 16_FW(: 3: ) response : response x => { 100 <= x && x < 600}; 207. 136. 97. 50 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 Typedefs allow the user to add constraints to existing types. AT&T
Example: Penum method { GET, PUT, POST, HEAD, DELETE, LINK, /- Unused after HTTP 1. 0 UNLINK /- Unused after HTTP 1. 0 }; 207. 136. 97. 50 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 Enumerations are strings on disk, 32 -bit integers in memory. AT&T
Example: User Constraints int chk. Version(http_v version, method meth) { … Pstruct '"'; '"'; }; http_request { method meth; Pstring(: ' ': ) req_uri; http_v version : chk. Version(version, meth); 207. 136. 97. 50 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 AT&T
Dependencies • “Early” data often affects parsing of later data: – Lengths of sequences – Branches of switched unions • To accommodate this usage, we allow PADS types to be parameterized: Punion packets_t (: Puint 8 which, Puint 8 length: ) { Pswitch (which) { Pcase 1: header_t header; Pcase 2: body_t body; Pcase 3: trailer_t trailer; Pdefault: Pstring_FW(: length : ) unknown; }; AT&T
Common Log Format in PADS Parray Phostname{ Pstring_SE(: "/[. ]/": ) [] : Psep('. ') && Pterm(Pnosep); }; Punion host { Pip ip; Phostname host; }; Pstruct request { '"'; method meth; ' '; Pstring(: ' ': ) req_uri; ' '; version : chk. Version(version, meth); '"'; }; /- 135. 207. 23. 32 /- www. research. att. com Ptypedef Puint 16_FW(: 3: ) response : response x => { 100 <= x && x < 600}; Punion auth_id { Pchar unauthorized : unauthorized == '-'; Pstring(: ' ': ) id; }; Punion length { Pchar unavailable : unavailable == '-'; Puint 32 len; }; Penum method { GET, PUT, POST, HEAD, DELETE, LINK, UNLINK }; Precord Pstruct entry { host client; ' '; auth_id remote. ID; ' '; auth_id auth; " ["; Pdate(: ']': ) date; "] "; request; ' '; response; ' '; length; }; 207. 136. 97. 50 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /turkey/amnty 1. gif HTTP/1. 0" 200 3013 Pstruct version { "HTTP/"; Puint 8 major; '. '; Puint 8 minor; }; int chk. Version(version v, method m) { if ((v. major == 1) && (v. minor == 1)) return 1; if ((m == LINK) || (m == UNLINK)) return 0; return 1; }; AT&T Psource Parray clf { entry []; }
PADS Parsing Perror_t entry_read(P_t *pdc, entry_m* mask, entry_pd* pd, entry* rep); Invariant: If mask is “check and set” and parse descriptor reports no errors, then the in-memory representation is correct. AT&T
Using the Generated Code P_t entry_pd *p; pd; entry_m rep; mask; P_open(&p, 0, 0); P_io_fopen(p, “clf/data/2004. 11"); entry_m_init(p, &mask, P_Check. And. Set); . . . while (!P_io_at_eof(p)) { entry_read(p, &mask, &pd, &rep); if (pd. nerr > 0) { entry_write 2 io(p, ERR_FILE, &pd, &rep); } else { cnv. IPAddress(&rep); if (entry_verify(&rep)) { entry_write 2 io(p, CLEAN_FILE, &pd, &rep); } else { error(2, "Data transform failed. "); }}} AT&T
Leverage! • Convert PADS description into a collection of tools: – – – – Accumulators Histograms Clustering tool Formatters Translator into XML, with corresponding XML Schema. XQueries using Galax’s data interface Ad hoc data management console … • Long term goal: Provide a compelling suite of tools to overcome inertia of a new language and system. AT&T
Accumulators • Statistical profile of data:
Pretty Printer • Customizable program to reformat data: 207. 136. 97. 49 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /tk/p. txt HTTP/1. 0" 200 30 tj 62. aol. com - - [16/Oct/1997: 14: 32: 22 -0700] "POST /scpt/dd@grp. org/confirm HTTP/1. 0" 200 941 Normalize time zones Normalize delimiters Drop unnecessary values Filter/repair errors 207. 136. 97. 49|-|-|10/16/97: 01: 46: 51|GET| /tk/p. txt|1|0|200|30 tj 62. aol. com|-|-|10/16/97: 21: 32: 22|POST| /scpt/dd@grp. org/confirm|1|0|200|941 • Users can override pretty printing on a per type basis. • Used to normalize monitoring data before loading into a relational database. AT&T
XQuery Integration • XQueries can run over ad hoc data with PADS descriptions without converting data to XML. • PADS compiler generates a description-specific instance of the Galax data API, conforming to the generated XML Schema. $pads/Psource/elt [date/rep >= xs: date. Time("2004 -10 -01: 00: 00") and date/rep < xs: date. Time("2004 -11 -01: 00: 00")]] AT&T
Launch. PADS: GUI for ad hoc data SIGMOD 2006 Demo AT&T
Formal Theory • A core data description calculus (DDC) – Based on dependent type theory – Simple, orthogonal, composable types – Types transduce external data source to internal representation. • Encodings of high-level DDLs in low-level DDC Packet. Types PADS Data. Script DDC POPL 2006 AT&T
Future Research Directions • Design – – – How can we specify error-aware data transformations? Can we infer a data transformation between two descriptions? How can we express application-specific information? Can we automatically generate conforming data? Can we integrate with a data visualization system? • Implementation – How can we specialize generated libraries to incorporate applicationspecific information? – How can we optimize Xqueries over PADS data sources? • Theory – What is the expressiveness of PADS vs. context free grammars? – How do we add parametric polymorphism to PADS? • Engineering – How do we build the system to make it easy to add new base types? – New libraries and tools? New language bindings? AT&T
Try it! (Growing!) PADS Team: • Available for download with a CPL license. • Demo of accumulators, format program, and XML conversion. • Send us feedback! Kathleen Fisher (AT&T) Mary Fernandez (AT&T) Joel Gottlieb (AT&T) David Walker (Princeton) Yitzhak Mandelbaum (Princeton) Mark Daly (Princeton) Robert Gruber (Google) Martin Strauss (Michigan) Xuan Zheng (Michigan) www. padsproj. org AT&T


