878bc352800cf600beee4a222c9fffe0.ppt
- Количество слайдов: 26
The Next 700 Data Description Languages Yitzhak Mandelbaum Princeton University Computer Science Collaborators: Kathleen Fisher and David Walker
“The Next 700 …” Program(s) Data Format(s) Programming Language(s) Data Description Language(s) PL Semantics DDL Semantics
What Data Needs Describing? • There's much data in databases and common formats like XML; there’s much data that’s ad hoc. • Ad hoc data lacks readily available parsing, querying, analysis or transformation tools • It’s all over the place: financial, telecomm, chemistry, physics, biology, etc.
Ad Hoc Data in Biology !autogenerated-by: DAG-Edit version 1. 419 rev 3 !saved-by: gocvs !date: Fri Mar 18 21: 00: 28 PST 2005 !version: $Revision: 3. 223 $ !type: % is_a is a !type: < part_of part of !type: ^ inverse_of inverse of !type: | disjoint_from disjoint from $Gene_Ontology ; GO: 0003673 <biological_process ; GO: 0008150 %behavior ; GO: 0007610 ; synonym: behaviour %adult behavior ; GO: 0030534 ; synonym: adult behaviour %adult feeding behavior ; GO: 0008343 ; synonym: adult feeding behaviour % feeding behavior ; GO: 0007631 %adult locomotory behavior ; GO: 0008344 ; . . . from www. geneontology. org
Ad Hoc Data in Chemistry O=C([C@@H]2 OC(C)=O)[C@@]3(C)[C@](CO 4) (OC(C)=O)[C@H]4 C[C@@H]3 O)([H])[C@H] (OC(C 7=CC=CC=C 7)=O)[C@@]1(O)[C@@](C)(C)C 2=C(C) [C@@H](OC([C@H](O)[C@@H](NC(C 6=CC=CC=C 6)=O) C 5=CC=CC=C 5)=O)C 1
Ad Hoc Data from Web Server Logs (CLF) 207. 136. 97. 49 - - [15/Oct/1997: 18: 46: 51 -0700] "GET /tk/p. txt HTTP/1. 0" 200 30 tj 62. aol. com - - [16/Oct/1997: 14: 32: 22 -0700] "POST /scpt/dd@grp. org/confirm HTTP/1. 0" 200 941
Ad Hoc Data: DNS packets 0000: 9192 d 8 fb 8480 0001 05 d 8 0000 0872 00000010: 6573 6561 7263 6803 6174 7403 636 f 6 d 00 00000020: 00 fc 0001 c 0006 0001 0000 0 e 10 0027 00000030: 036 e 7331 c 00 c 0 a 68 6 f 73 746 d 6173 7465 00000040: 72 c 0 0 c 77 64 e 5 4900 000 e 1000 0003 8400 00000050: 36 ee 8000 000 e 10 c 00 0 f 00 0100 000 e 00000060: 1000 0 a 05 6 c 69 6 e 75 78 c 0 0 c 00 00000070: 0 f 00 0100 000 e 1000 0 c 00 0 a 07 6 d 61 696 c 00000080: 6 d 61 6 ec 0 0 c 00 0100 000 e 1000 00000090: 0487 cf 1 a 16 c 0 0 c 00 0200 0100 000 e 1000 000000 a 0: 0603 6 e 73 30 c 00 0200 0100 000 e 000000 b 0: 1000 02 c 0 2 e 03 5 f 67 63 c 0 0 c 00 2100 000000 c 0: 0002 5800 1 d 00 0000 640 c c 404 7068 7973 000000 d 0: 0872 6573 6561 7263 6803 6174 7403 636 f . . . . r esearch. att. com. . . . '. ns 1. . . hostmaste r. . wd. I. . 6. . . . . linux. . . . mail man. . . . ns 0. . . . _gc. . . !. . . X. . . d. . . phys. research. att. co
Data Description Languages • Data description languages describe many ad hoc formats and provide the following features: – Descriptions serves as documentation, including semantic of data – Compiler generates tools from description: parser, printer, query engine, converter to XML, statistical profiler, etc. – Parser includes robust error detection and recovery. – Parsers can handle high data volume. • > 1 GB/second Netflow traffic from Cisco routers.
Many Data Description Languages • Logical Descriptions – ASN. 1 – ASDL Physical 00101 101001001 111001010100 • Physical Descriptions – Packet. Types (SIGCOMM ‘ 00) – Data. Script (GPCE ‘ 02) – PADS (PLDI ‘ 05) • Basis for current work Logical
Contributions • A core data description calculus (DDC) – Based on dependent type theory – Simple, orthogonal, composable types – Types are transducers from external data source to internal data representation. • Encodings of high-level DDLs in low-level DDC – Explain semantics of PADS language in particular. Packet. Types PADS Datascript DDC
Base Types and Sequences • C(e): base type can be parameterized by expression e. • x: T. T’: dependent product describes sequence of values. – Variable x gives name to first value in sequence. • Examples: “ 123 hello|” int * string(‘|’) * char (123, “hello”, ‘|’) “ 3513” width: int_fw(1). int_fw(width) (3, 513) “: hello: ” term: char. string(term) * char (‘: ’, “hello”, ‘: ’)
Constraints • {x: T | e}: set types allow you to constrain the type T and express relationships between elements of the data. • Examples: ‘a’ {c: char | c = ‘a’} (abbrev: Sc(‘a’)) inl ‘a’ “ 101”, “ 82” {x: int | x > 100} inl 101, inr error(82) min: int. Sc(‘|’) * “ 43|105|67” max: {m: int | min ≤ m}. Sc(‘|’) * {avg: int | min ≤ avg & avg ≤ max} (43, inl ‘|’, inl 105, inl ‘|’, inl 67)
Unions and the Empty String • true: matches the empty string. • T + T’ : deterministic, exclusive or: try T; on failure, try T’. • Examples: “ 54”, “n/a” int + Ss(“n/a”) inl 54, inr “n/a” “ 2341”, “” int + true inl 2341, inr ()
Array Features • What features do we need to handle data sequences? – – Elements Separator between elements Termination condition (“are we done yet? ”) Terminator after sequence • Examples: “ 192. 168. 1. 1” “Bill|Cathy|Jane|Bob; ”
False and Arrays • T seq(Ts; e, Tt) specifies: – – Element type T Separator types Ts. Termination condition e. Terminator type Tt. • false: reads nothing, flagging an error. • Example: IP address. “ 192. 168. 1. 1” int seq(Sc(‘. ’); len 4, false) [192, 168, 1, 1]
Abstraction and Application • Can parameterize types over values: x. T • Correspondingly, can apply types to values: T e • Example: IP address with terminator none term. int seq(Sc(‘. ’); len 4, Sc(term)) none “ 1. 2. 3. 4|” IP_addr ‘|’ * Sc(‘|’) ([1, 2, 3, 4], inl ‘|’)
Absorb, Compute and Scan • Absorb, Compute and Scan are active types. – absorb(T) : consume data from source; produce nothing. – compute(e: ) : consume nothing; output result of computation e. – scan(T) : scan data source for type T. • Examples: “|” absorb(Sc(‘|’)) () “ 10|12” width: int. Sc(‘|’) * length: int. area: compute(width length: int) (10, 120) “^%$!&_|” scan(Sc(‘|’)) (6, inl ‘|’)
Type Kinding • Kinding ensures types are well formed. |- T : s k |- e : s |- T e: k |- T : type |- T’ : type |- T + T’: type |- T : type , x: s |- e : bool |- {x: T | e}: type (s = …)
Parsing Semantics of Types • Semantics expressed as parsing functions written in the polymorphic -calculus. – Sem(T) : DDC Type Function – Input data and offset, output new offset, value and parse descriptor. – For specifics, see upcoming technical report.
Types of Parser Output • Parsers produce values with following type in the host language: DDC Products [C(e)]rep I(C) + noval [true]rep Base Types Host Language unit [ x: T. T’]rep [T]rep * [T’]rep unrecoverable error dependency erased Abs. and App. [ x. T]rep, [T e]rep [T]rep Union [T + T’]rep [T]rep + [T’]rep + noval Set types [{x: T | e}]rep [T]rep + ([T]rep error) semantic error
Properties of the Calculus • Theorem: If |- T : k then – [T] = F well formed types yield parsers – |- F : bits * offset * [T]rep * [T]pd a T-Parser returns values with types that correspond to T. • Theorem: Parsers report errors accurately. – Errors in parse descriptor correspond to actual errors in data. – Parsers check all semantic constraints. – More …
Making Use of the Calculus IPADS t : : = C(e) | Pfun(x: s) = t | t e | Pstruct{fields} | Punion{fields} | Pswitch e of {alts tdef; } | Popt t | t Pwhere x. e | Palt{fields} | t [t; e, t] | Pcompute e | Plit c fields : : = | fields x : t; alts : : = | alts e => t; DDC |- t T
Example: Popt and Plit true T 1 + T 2 C(e) {x: T | e} absorb(T) scan(T) |- t T |- Popt t T + true |- c : char |- Plit c scan(absorb({x: char | x = c}))
Example: Pswitch T + T’ x. T {x: T|e} |- ti Ti (i = 1…n) |- tdef Tdef |- Pswitch e of {e 1 => t 1; e 2 => t 2; … tdef} ( c. {x: T 1 | c = e 1} + {x: T 2 | c = e 2} + …+ Tdef) e
Future work • What are the set of languages recognized by the DDC? • How does the expressive power of the DDC relate to CFGs and regular expressions? • Implement recursive types in PADS system based on the recursive types of the DDC. • Add polymorphism to DDC and PADS.
Summary • Data description languages are well-suited to describing ad hoc data. • No one DDL will ever be right - different domains and applications will demand different languages with differing levels of expressiveness and abstraction. • Our work defines the first semantics for data description languages. • For more information, visit www. padsproj. org.
878bc352800cf600beee4a222c9fffe0.ppt