41a4df24eb36c9e1f4c1836c84b42771.ppt
- Количество слайдов: 54
Pads/ML: A Functional Data Description Language David Walker Princeton University with: Yitzhak Mandelbaum (Princeton), Kathleen Fisher and Mary Fernandez (AT&T)
Data, data everywhere! Incredible amounts of data stored in well-behaved formats: Databases: XML: Tools • • • Schema Browsers Query languages Standards Libraries Books, documentation Conversion tools Vendor support Consultants… 2
Ad hoc Data • Vast amounts of data in ad hoc formats. • Ad hoc data is semi-structured: – Not free text. – Not as rigid as data in relational databases. • Examples from many different areas: – – – – Physics Computer system maintenance and administration Biology Finance Government Healthcare More! 3
Ad Hoc Data in Biology format-version: 1. 0 date: 11: 2005 14: 24 auto-generated-by: DAG-Edit 1. 419 rev 3 default-namespace: gene_ontology subsetdef: goslim_goa "GOA and proteome slim" [Term] id: GO: 0000001 name: mitochondrion inheritance namespace: biological_process def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton. " [PMID: 10873824, PMID: 11389764, SGD: mcc] is_a: GO: 0048308 ! organelle inheritance is_a: GO: 0048311 ! mitochondrion distribution www. geneontology. org 4
Ad Hoc Data in Chemistry O=C([C@@H]2 OC(C)=O)[C@@]3(C)[C@](CO 4) (OC(C)=O)[C@H]4 C[C@@H]3 O)([H])[C@H] (OC(C 7=CC=CC=C 7)=O)[C@@]1(O)[C@@](C)(C)C 2=C(C) [C@@H](OC([C@H](O)[C@@H](NC(C 6=CC=CC=C 6)=O) C 5=CC=CC=C 5)=O)C 1 5
Ad Hoc Data in Finance HA 0000 START OF TEST CYCLE a. A 00000001 BXYZ U 1 AB 0000040000100 B 0000004200 HL 00000002 START OF OPEN INTEREST d 00000003 FZYX G 1 AB 00000300000 HM 00000004 END OF OPEN INTEREST HE 00000005 START OF SUMMARY f 00000006 NYZX B 1 QB 00052000120000070000 B 000050000000520000 00490000005100+00000100 B 00000005300000052500000535000 HF 00000007 END OF SUMMARY www. opradata. com 6
Ad Hoc Data from Web Server Logs (CLF) 207. 136. 97. 49 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /tk/p. txt HTTP/1. 0" 200 30 tj 62. aol. com - - [16/Oct/1997: 14: 32: 22 -0700] "POST /scpt/dd@grp. org/confirm HTTP/1. 0" 200 941 234. 200. 68. 71 - - [15/Oct/1997: 18: 53: 33 -0700] "GET /tr/img/gift. gif HTTP/1. 0” 200 409 240. 142. 174. 15 - - [15/Oct/1997: 18: 39: 25 -0700] "GET /tr/img/wool. gif HTTP/1. 0" 404 178 188. 168. 121. 58 - - [16/Oct/1997: 12: 59: 35 -0700] "GET / HTTP/1. 0" 200 3082 214. 201. 210. 19 ekf - [17/Oct/1997: 10: 08: 23 -0700] "GET /img/new. gif HTTP/1. 0" 304 - 7
Ad Hoc Data: DNS packets 0000: 9192 d 8 fb 8480 0001 05 d 8 0000 0872 00000010: 6573 6561 7263 6803 6174 7403 636 f 6 d 00 00000020: 00 fc 0001 c 0006 0001 0000 0 e 10 0027 00000030: 036 e 7331 c 00 c 0 a 68 6 f 73 746 d 6173 7465 00000040: 72 c 0 0 c 77 64 e 5 4900 000 e 1000 0003 8400 00000050: 36 ee 8000 000 e 10 c 00 0 f 00 0100 000 e 00000060: 1000 0 a 05 6 c 69 6 e 75 78 c 0 0 c 00 00000070: 0 f 00 0100 000 e 1000 0 c 00 0 a 07 6 d 61 696 c 00000080: 6 d 61 6 ec 0 0 c 00 0100 000 e 1000 00000090: 0487 cf 1 a 16 c 0 0 c 00 0200 0100 000 e 1000 000000 a 0: 0603 6 e 73 30 c 00 0200 0100 000 e 000000 b 0: 1000 02 c 0 2 e 03 5 f 67 63 c 0 0 c 00 2100 000000 c 0: 0002 5800 1 d 00 0000 640 c c 404 7068 7973 000000 d 0: 0872 6573 6561 7263 6803 6174 7403 636 f . . . . r esearch. att. com. . . . '. ns 1. . . hostmaste r. . wd. I. . 6. . . . . linux. . . . mail man. . . . ns 0. . . . _gc. . . !. . . X. . . d. . . phys. research. att. co 8
Properties of Ad hoc Data • Data arrives “as is” -- you don’t choose the format • Documentation is often out-of-date or nonexistent. – Hijacked fields. – Undocumented “missing value” representations. • Data is buggy. – Missing data, “extra” data, … – Human error, malfunctioning machines, software bugs (e. g. race conditions on log entries), … – Errors are sometimes the most interesting portion of the data. • Data sources often have high volume. – Data might not fit into main memory. • Data can be created by malicious sources attempting to exploit software vulnerabilities – c. f. Ethereal network monitoring system 9
The Goal(s) • What can we do about ad hoc data? – – – – how do we read it into programs? how do we detect errors? how do we correct errors? how do we query it? how do we discover its structure and properties? how do we view it? how do we transform it into standard formats like CSV, XML? how do we merge multiple data sources? • In short: how do we do all the things we take for granted when dealing with standard formats in a faulttolerant and efficient, yet nearly effortless way? 10
Enter Pads • Pads: a system for Processing Ad hoc Data Sources • Three main components: – a data description language • for concise and precise specifications of ad hoc data formats and semantic properties – a compiler that automatically generates a suite of programming libraries & end-to-end applications – a visual interface to support both novice and expert users 11
One Description, Many Tools Data Description (Type T) compiler xml translator query engine parser printer visual data browser . . . programming library complete application 12
Some Advantages Over Ad Hoc Methods • Big bang for buck: 1 description, many tools • Descriptions document data sources – the documentation IS the tool generator so documentation is automatically kept up-to-date with implementation • Descriptions are easy to write, easy to understand. – descriptions are high-level & declarative – description syntax exploits programmer intuition concerning types • Tools are robust – Error handling code generated automatically; doesn’t clutter documentation. • Descriptions & generated tools can be analyzed and reasoned about – eg: data size, tool termination & safety properties, coherence of generated parsers & printers 13
The PADS Project • PADS/C [PLDI 05; POPL 06] – Based on C type structure. – Generates C libraries. • too bad C doesn’t actually support libraries. . – Launch. PADS visual interface [Daly et al. , SIGMOD 06] • PADS/ML (Mandelbaum’s thesis) – Based on the ML type structure. • polymorphic, dependent datatypes – Generates ML modules. • better reuse & library structure • functional data processing = far greater programmer productivity – New framework for tool development. • Format-independent algorithms architected using functors vs macros – Implementation status. • Version 1. 0 up and running • Many more exciting things to do • Describe real formats: – – – – – Newick tree-structured data Reglens galaxy catalogues Palm PDA databases AT&T call-detail data AT&T billing data Web server logs Gene ontologies DNS packets OPRA data More … 14
Outline • Motivation and PADS Overview • Data Description in PADS/ML • Implementation architecture • The Semantic of PADS • Conclusions 15
Base Types and Records • Base types: C (e). – Describe atomic portions of data. – Parameterized by host-language expression. – Examples: • Pint, Pchar, Pstring_FW(n), Pstring(c). • Tuples and Records: t * t’ and {x: t; y: t’}. – Record fields are dependent: field names can be referenced by types of later fields. – Example to follow. 16
Base Types and Records Movie-director Bowling Score (MBS) Format 122 Joe|Wright|45|95|79 n/a. Ed|Wood|10|47|31 124 Chris|Nolan|80|93|85 125 Tim | Burton|30|82|71 126 George|Lucas|32|62|40 Pint * Pstring(‘|’) * Pchar 17
Base Types and Records Bookshelf Listing (BL) Format 13 C Programming 31 Types and Programming Languages 20 Twenty Years of PLDI 36 Modern Compiler Implementation in ML 27 Elements o f ML Programming { width: Pint ; title: Pstring_FW(width) } 18
Constraints • Constrained types: [x: t | e]. – Enforce the constraint e on the underlying type t. 125 Tim | Burton | 30| 82| 71 [c: Pchar | c = ‘|’] Pchar ‘|’ ptype Scores = { min: Pint; ‘|’; max: [m: Pint | min ≤ m]; ‘|’; avg: [a: Pint | min ≤ a & a ≤ max] } 19
Datatypes • Describe alternatives in data source with datatypes. – Parser tries each alternative in order. 122 Joe|Wright|45|95|79 n/a. Ed|Wood|10|47|31 124 Chris|Nolan|80|93|85 125 Tim|Burton|30|82|71 126 George|Lucas|32|62|40 pdatatype Id = None of “n/a” | Some of Pint 20
Recursive Datatypes • Describe inductively-defined formats. 79| 31|71| 40 pdatatype Int. List = Cons of Pint * ‘|’ * Int. List | Last of Pint 21
Polymorphic Types • Parameterize types by other types. pdatatype Int. List = Cons of Pint * ‘|’ * Int. List | Last of Pint pdatatype (Elt) List = Cons of Elt * ‘|’ * (Elt) List | Last of Elt ptype Int. List = Pint List ptype Char. List = Pchar List 22
Dependent Types • Parameterize types by values. pdatatype Int. List = Cons of Pint * ‘|’ * Int. List | Nil of Pint pdatatype (Elt) List (x: char) = Cons of Elt * x * (Elt) List(x) | Nil of Elt ptype Int. List. Bar = Pint List(‘|’) ptype Char. List. Comma = Pchar List (‘, ’) 23
More Dependent Types • “Switched” datatypes: pdatatype Guided. Option (tag: int) = pmatch tag of 0 => Zero of Pstring | 1 => One of Pint | 2 => Two of Pint * Pint | _ => None ptype source = {tag: Pint; payload: Guided. Option (tag)} 24
bool) = { name : [name : Pstring(’=’) | p name]; ’=’; value : Alpha } ptype (Alpha) Nvp(name: string) = Alpha Pnvp(fun s -> s = name) ptype SVString = Pstring_SE("/; |\|/") ptype Nvp_a = SVString Pnvp(fun _ -> true) ptype Details = { source : Pip Nvp("src_addr"); ’; ’; dest : Pip Nvp("dest_addr"); ’; ’; start_time : Timestamp Nvp("start_time"); ’; ’; end_time : Timestamp Nvp("end_time"); ’; ’; cycle_time: Puint 32 Nvp("cycle_time") } ptype Semicolon = Pcharlit(’; ’) ptype Vbar = Pcharlit(’|’) pdatatype Info(alarm_code : int) = Pmatch alarm_code with 5074 -> Details of Details |_ -> Generic of (Nvp_a, Semicolon, Vbar) Plist pdatatype Service = Dom of "DOMESTIC" | Int of "INTERNATIONAL" | Spec of "SPECIAL" ptype Raw_alarm = { alarm : [ i : Puint 32 | i = 2 or i = 3]; ’: ’; start : Timestamp Popt; ’|’; clear : Timestamp Popt; ’|’; code : Puint 32; ’|’; src_dns : SVString Nvp("dns 1"); ’; ’; dest_dns : SVString Nvp("dns 2"); ’|’; info : Info(code); ’|’; service : Service } let check. Corr ra =. . . ptype Alarm = [x: Raw_alarm | check. Corr x] ptype Source = (Alarm, Peor, Peof) Plist Sample Regulus Data: 2: 3004092508||5001|dns 1=abc. com; dns 2=xyz. com|c=slow link; w=lost packets|INTERNATIONAL 3: |3004097201|5074|dns 1=bob. com; dns 2=alice. com|src_addr=192. 168. 0. 10; dst_addr=192. 168. 23. 10; start_time=1234567890; end_time=1234568000; cycle_time=17412|SPECIAL 25
Outline • Motivation and PADS Overview • Data Description in PADS/ML • Implementation architecture • The Semantic of PADS • Conclusions 26
Parsing With PADS data description (type T) compiler 01001001 00111 data rep (type ~ T) parser user code parse descriptor (type ~ T) 27
Example: MBS Representation n/a. Ed|Wood|10|47|31 pdatatype Id = None of “n/a” | Some of Pint datatype Id = None | Some of int ptype MBS-Entry = { id: Id; first: Pstring(‘|’); ‘|’; last: Pstring(‘|’); ‘|’; scores: Scores } type MBS-Entry = { id: Id; first: string; last: string; scores: Scores } 28
Tool Generation With PADS/ML data description (type T) compiler 01001001 00111 parser formatindependent tool module data rep (type ~ T) format-specific traversal functor parse descriptor (type ~ T) tools in this pattern: accumulator, debugger, histograms, clusters, format converters 29
Types as Modules • PADS/ML generates a module for each type/description sig type rep type pd fun parser : Pads. handle -> rep * pd module Traverse (tool : TOOL) : sig. . . end • Parameterized types ==> Functors • Recursive types ==> Recursive modules – sigh: combination of recursive modules & functors not supported in O’Caml, so we’re reduced to a bit of a hack for recursion 30
Outline • Motivation and PADS Overview • Data Description in PADS/ML • Implementation architecture • The Semantic of PADS • Conclusions 31
Motivation • To crystallize design principles. – Example: error counting methodology in PADS/C. • To ensure system correctness. – Example: parsers return data of expected type. • As basis for evolution and experimentation. – Critical to design of PADS/ML. • To communicate core ideas. – Designing the next 700 data description languages. 32
PADS and DDC • Developed semantic framework based on Data Description Calculus (DDC). • Explains PADS/ML and other languages with DDC. • Give denotational semantics to DDC. PADS/C PADS/ML The Next 700 DDC 33
Data Description Calculus • DDC: calculus of dependent types for describing data. t : : = unit | bottom | C(e) | x: t. t | t + t | t & t | {x: t | e} | t seq(t, e, t) | x. e | t e | . t | t t | | . t | compute (e: ) | absorb(t) | scan(t) • Expressions e with type drawn from F-omega • A kinding judgment specifies well-formed descriptions. 34
Choosing a Semantics • Semantics of REs, CFGs given as sets of strings but fails to account for: – Relationship between internal and external data. – Error handling. – Types of representation and parse descriptor. • DDC – Denotational semantics of types as parsers in F-omega 35
A 3 -Fold Semantics [ {x: t | e} ]rep = [ t ]rep + [ t ]rep [ x: t. t’ ]rep = [ t ]rep * [ t’ ]rep Description t [ {x: t | e} ]pd = hdr * [ t ]pd [ x: t. t’ ]pd = hdr * [ t ]pd * [ t’ ]pd Interpretations of t [t] [ t ]rep [ t ]pd Representation 0100100100. . . Parser Parse Descriptor 36
Type Correctness Theorem [ t ] : bits [ t ]rep * [ t ]pd Description t Interpretations of t [t] [ t ]rep [ t ]pd Representation 0100100100. . . Parser Parse Descriptor 37
Outline • Motivation and PADS Overview • Data Description in PADS/ML • Implementation architecture • The Semantic of PADS • Conclusions 38
Related Work • parser generator technology: – Lex & Yacc • no dependency • semantic actions entwined with data description • no higher-level tools – Parser combinators • semantic actions entwined with data description • no higher-level tools 39
Reminder: One Description, Many Tools Data Description (Type T) compiler xml translator query engine parser printer visual data browser . . . programming library complete application 40
Parser combinators: One algorithm, One Tool parser 41
Related Work • Other “data description” languages – Data Format Description Language (DFDL) – Binary Format Description Language (BFD) – Packet. Types [SIGCOMM ’ 00] – Data. Script [GPCE ’ 02] • None have a well-defined semantics or Pads tool support 42
Current & Future Work • Tools and Applications – Description inference. – Support for specific domains (microbiology) • Language Design – Transformation language for ad hoc data. – Description language for distributed • Describe locations, versions, timing, relationships, etc. • Theory – Analyze data descriptions for interesting properties, e. g. equivalence, data size, termination, emptiness (always fails). – Coherence of parsing & printing 43
Summary • The PADS vision: reliable, efficient and effortless ad hoc data processing • PADS/ML: – Data description based on polymorphic, dependent datatypes – “Types as modules” implementation – Solid theoretical basis. • Visit www. padsproj. org 44
The End Questions? 45
Existing Approaches • C, Perl, or shell scripts: most popular. – Time consuming & error prone to hand code parsers. – Difficult to maintain (worse than the ad hoc data itself in some cases!). – Often incomplete, particularly with respect to errors. • Error code, if written, swamps main-line computation. • If not written, errors can corrupt “good” data. • Lex & Yacc – Good match for programming languages. – Bad match for ad hoc data. • Compiler converts descriptions into robust, formatspecific tools. 49
Parsing With PADS • Robust parser at the core of generated tools. 50
Using Ad hoc Data • Can Ed Wood bowl? 122 Joe|Wright|45|95|79 n/a. Ed|Wood|10|47|31 124 Chris|Nolan|80|93|85 125 Tim|Burton|30|82|71 126 George|Lucas|32|62|40 • Parsing only brings you part way. – Queries must be written in ML. – A lot of work. • What about a declarative query? 51
From Ad hoc Data To XML • XML – Encoding for semi-structured data. – Good match! • XQuery – Declarative XML query language for semi-structured sources. – Standardized by W 3 C, many implementations. 52
PADX = PADS + XQuery • Galax [Fernandez, et al. ] – Complete, open-source XQuery implementation. • PADX – Integrates PADS and Galax. – Supports declarative queries over ad hoc data sources. 53
Using PADX • User describes format in PADS. • PADX provides – XML “view” of data in XML Schema. – Customized XQuery engine. • Query PADS-specific and other XML sources. • User provides – Ad hoc data – Queries expressed in XQuery. 54
Describing MBS Format • Example Movie-director Bowling Score data n/a. Ed|Wood|10|47|31 • PADS/ML Description ptype MBS-Entry = { id: Id; first: Pstring(‘|’); ‘|’; last: Pstring(‘|’); ‘|’; scores: Scores } 55
Viewing and Querying MBS • Virtual XML view
Challenges & Solutions • Semantics – Map PADS language to XML Schema. • Re-engineer Galax Data Model – Create abstract data model. – Generate description-specific concrete data models. • Efficiently query large-scale data sources. – Provide lazy access to data. – Implement custom memory-management. 57