e32a35204c92e5ce5169b8af490d64c3.ppt
- Количество слайдов: 89
1 Information Management in P 2 P Serge Abiteboul INRIA-Futurs and Univ. Paris 11 P 2 P Data Management, 2006, S. Abiteboul 1
2 Introduction
Success stories at the time of the Internet bubble 3 Google: management of Web pages Mapquest: management of maps Amazone: book catalogue e. Bay: product catalogue Napster (emule, bearshare, etc. ): music database Flickr: picture database Wikipedia: dictionary del. icio. us: annotations In France: Meetic: dating database Kelkoo: comparative shopping P 2 P Data Management, 2006, S. Abiteboul They are all about publishing some database 3
The trend is towards peer-to-peer and interactivity 4 P 2 P: A large and varying number of computers cooperate to solve some particular task without any centralized authority Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network seti@home; kazaa; cabal Switch from centralized servers to communities and syndication Interaction and Web 2. 0 Motivations: Social, organizational P 2 P Data Management, 2006, S. Abiteboul 4
5 Information management in a P 2 P network Private terminology: data ring Information is heterogeneous, distributed, replicated, dynamic Which info: Data + meta-data + knowledge + services Peers are heterogeneous, autonomous and possibly mobile • From sensors to PDA to mainframe Typically very large number of peers Variety of requirements: Qo. S, performance, security, etc. P 2 P Data Management, 2006, S. Abiteboul 5
6 Acknowledgement Xyleme: Scalable XML warehousing Sophie Cluet, Guy Ferran (Xyleme) & many others Active. XML: Language for P 2 P data management Omar Benjelloun (Google), Ioana Manolescu, Tova Milo (Tel Aviv) & many others Kado. P: P 2 P scalable XML indexing Ioana Manolescu, Nicoleta Preda & others Data Ring: Infrastructure for P 2 P data management Alkis Polyzotis (UC Santa Cruz) P 2 P Data Management, 2006, S. Abiteboul 6
7 Outline 1. Introduction – the data ring 2. Calculus for P 2 P data management (Active. XML) 3. Algebra for P 2 P data management (Active. XML algebra) 4. Indexing in P 2 P (Kado. P) 5. Conclusion Goal of the tutorial: present issues and technology on p 2 p information management Warning: it is very biased – it is not a survey P 2 P Data Management, 2006, S. Abiteboul 7
8 Outline 1. Introduction – the data ring 2. Calculus for P 2 P data management (Active. XML) 3. Algebra for P 2 P data management (Active. XML algebra) 4. Indexing in P 2 P (Kado. P) 5. Conclusion P 2 P Data Management, 2006, S. Abiteboul 8
9 1. Introduction – the data ring
10 The information in a data ring Data: tuples, collections, documents, relations… Services: data sources, possibly some processing Meta-data about resources: attribute/values pairs, annotations Ontologies to explain data and metadata View definitions Data integration information, e. g. , mappings between ontologies Physical data: Indices and materialized views P 2 P Data Management, 2006, S. Abiteboul 10
11 Functionalities of the data ring Storage, persistence, replication Indexing, caching, querying, updating, optimization Schema management, access control Fault tolerance, self tuning, monitoring Resource discovery, history, provenance, annotations, multi-linguism, Semantic enrichment, uncertain data Each functionality may be achieved by a peer or by the network P 2 P Data Management, 2006, S. Abiteboul 11
12 And now, what is a peer? A mainframe database A manufacturing tool A file system A telecom equipment Web server A toy A PC Another data ring A PDA A telephone A sensor A home appliance A car P 2 P Data Management, 2006, S. Abiteboul Any connected device or software with some information to share A net address and some names of resources (e. g. document or service) 12
13 Advantages and disadvantages of P 2 P Scaling Complexity Performance • • • Optimization of parallelism Avoid bottleneck Replication Availability • Replication Cost • • Avoid the cost of server Share operational cost Dynamicity • add/remove new data sources P 2 P Data Management, 2006, S. Abiteboul • • Cost for complex queries Communication cost Availability • Peers can leave Consistency maintenance • Difficult to support transaction Quality • Difficult to guarantee quality 13
Crash course on Web standards Data exchange format = XML • • • Labeled ordered trees Its main asset = XML schema There is much more XML Owl 14 RDFS SOAP WSDL Xquery Xpath Distributed computing protocol = Web services • • SOAP WSDL UDDI BPEL Simple Object Access Protocols Web Service Definition Language Universal Description, Discovery and Integration Business Process Execution Language Query languages = XPath and XQuery • Declarative query language for XML + full-text + update language Knowledge representation = Owl or RDF/S P 2 P Data Management, 2006, S. Abiteboul 14
15 Information used to live in islands but with the Web, this is changing: uniform access to information… …the dream for distributed data management
16 Do you like the standards? It is the wrong question! Correct questions: What can you do with it? What is missing? Is Xquery the ultimate query language for the Web? No • • It is a language for querying centralized XML We will see what it is missing We will not talk much about semantics P 2 P Data Management, 2006, S. Abiteboul 16
Automatic and distributed management of the data ring 17 No centralized server No information administrator (no info manager) Most users are non-experts • E. g. , scientists Requirements • • • Ease of deployment (zero-effort) Ease of administration (zero-effort) Happy database admin Ease of publication (epsilon-effort) Ease of exploitation (epsilon-effort) Participation in community building notably via annotations P 2 P Data Management, 2006, S. Abiteboul 17
18 What should be made automatic Self-statistics from the monitoring of the data ring • In particular, define the statistics that are needed* Self-tuning based on the self-statistics • • • Choose the most appropriate organization Decide to install access structures: indexes, views, etc. Control replication of data and services Self-healing • • Recovery from errors E. g. , replacement of a failing Web service And automatic file management P 2 P Data Management, 2006, S. Abiteboul 18
19 Any hope? Technology exists (database self-tuning, machine learning, etc. ) But self-tuning for databases has advanced very slowly Why can this work? 1. There is no alternative (for db, this was just a cool gadget) 2. KISS (keep it simple stupid!) 3. The power of parallelism 4. This is assuming lots of machine have free cycles (true) and bandwidth is generous (not always true) P 2 P Data Management, 2006, S. Abiteboul 19
20 Distributed access control Goal: Control access to ring resources Access to resources is based on access rights (ACL) Who is controlling ACLs? A node manages ACLs for a collection of distributed resources • Easy but against the spirit and possible bottleneck The network manages access control • • • Anybody can get the data The data is published with encryption and signatures; only nodes with proper access rights can perform reads/writes Some techniques exist P 2 P Data Management, 2006, S. Abiteboul 20
21 Monitoring What is monitored? Web service calls and database updates The Web – Web pages – RSS feed What is produced? A stream of events – As a continuous service – As a RSS feed – As a Web site/page Info-surveillance Self-statistics and tracing Basis for error diagnosis P 2 P Data Management, 2006, S. Abiteboul 21
22 Streams are everywhere In query processing In indexing (Kado. P) In recursive queries (AXML-QSQ) In messaging, monitoring and pub/sub That is why we will use an algebra over streams of trees and not simply an algebra over trees P 2 P Data Management, 2006, S. Abiteboul 22
23 Example: Edos distribution system A system for the management of Linux distribution Joint work with Mandriva Software and U. Tel Aviv Community of open-source developers: thousands System releases: about 10 000 software packages + metadata Functionalities • • Query the metadata Query subscription Retrieve packages Publish a new release or update an existing one P 2 P Data Management, 2006, S. Abiteboul 23
24 Exemple: Web. Content: an ANR platform for the management of web content Web surveillance • Business, technical, web watching… Participation of Gemo • • WP 3: knowledge WP 5: P 2 P content management Partners: CEA, EADS, Thales, Bongrain, Xyleme, Exalead, many research groups (UVSQ, Grenoble, Paris 6, etc. ) P 2 P Data Management, 2006, S. Abiteboul 24
25 Taxonomy of such applications Parameters • • Number of peers and quantity of data How volatile the peers are The query/update workload The functionalities that are desired Edos: peers and documents in thousands, mostly append for updates, peers not too volatile An extreme: Google search engine in P 2 P for billions of documents using millions of hyper volatile peers Mostly interested in the first case P 2 P Data Management, 2006, S. Abiteboul 25
26 Thesis XQuery is fine for local XML processing and publishing Not sufficient for distributed data management The success of the relational model, i. e. , of tables on a server : 1. A logic for defining tables 2. An algebra for describing query plans over tables By analogy, we need for trees in a P 2 P system 1. A logic for defining distributed tree data and data services 2. An algebra for describing query plans over these Proposal: Active. XML logic and algebra P 2 P Data Management, 2006, S. Abiteboul 26
27 Outline 1. Introduction – the data ring 2. Calculus for P 2 P data management (Active. XML) 3. Algebra for P 2 P data management (Active. XML algebra) 4. Indexing in P 2 P (Kado. P) 5. Conclusion P 2 P Data Management, 2006, S. Abiteboul 27
28 2. Active XML: a logic for distributed data management
29 The basis AXML is a declarative language for distributed information management and an infrastructure to support the language in a P 2 P framework Simple idea: XML documents with embedded service calls Intensional data • Some of the data is given explicitly whereas for some, its definition (i. e. the means to acquire it when needed) is given Dynamic data • If the data sources change, the same document will provide different information P 2 P Data Management, 2006, S. Abiteboul 29
30 Example (omitting syntactic details)
31 Marketing Philosophy Active answer = intensional and dynamic and flexible Embedding calls in data is an old idea in database Manon: What’s the capital of Brazil? Dad: Let’s ask Wikipedia. com! Manon: How do I get a cheap ticket to Galapagos? Dad: Let’s place a subscription on Last. Minute. com! Manon: What are the countries in the EC? Dad: France, Germany, Holland, Belgium, and hum… Let’s ask You. Lists. com for more! P 2 P Data Management, 2006, S. Abiteboul 31
Active XML peer soap 32 AXML peer Peer-to-peer architecture Each Active XML peer • • • Repository: manages Active XML data Web client: calls the services inside a document Web server: provides (parameterized) queries/updates over the repository as web services Exchange of AXML instead of XML P 2 P Data Management, 2006, S. Abiteboul 32
33 What is an AXML peer? PC • Now open source Object. Web – queries in OQL Peer on a mass storage system • • e. Xist (open source XML database) queries in XQuery Xyleme queries in Xy. QL PDA or cell phone • Persistence in file system and XPATH On going: the entire network • Data is stored in a P 2 P network - Kado. P More: java card, a relational database… P 2 P Data Management, 2006, S. Abiteboul 33
34 A key issue: call activation When to activate the call? • Explicit pull mode: active databases • Implicit pull mode: deductive databases • Push mode: query subscription What to do with its result? How long is the returned data valid? • Mediation and caching Where to find the arguments? • Under the service call: XML, XPATH or a service call P 2 P Data Management, 2006, S. Abiteboul 34
35 Another key issue: what to send? Send some AXML tree t • As result of a query or as parameter of a call The tree t contains calls, do we have to evaluate them? • If I do, I may introduce service calls, do we have to evaluate all these calls before transmitting the data? Hi John, what is the phone number of the Prime Minister of France? • Find his name at whoswho. com then look in the phone dir • Look in the yellow pages for de. Villepin’s in phone dir of www. gov. fr • (33) 01 56 00 01 P 2 P Data Management, 2006, S. Abiteboul 35
Active XML cool idea – complex problems 36 Blasphemous claim: Active XML is the proper paradigm for data exchange! Not XML + not XQuery Brings to a unique setting distributed db, deductive db, active db, stream data warehousing, mediation This is unreasonable? Yes! Plenty of works ahead… to make it work But first, the algebra P 2 P Data Management, 2006, S. Abiteboul 36
37 Outline 1. Introduction – the data ring 2. Calculus for P 2 P data management (Active. XML) 3. Algebra for P 2 P data management (Active. XML algebra) • • Query processing Query optimization 4. Indexing in P 2 P (Kado. P) 5. Conclusion P 2 P Data Management, 2006, S. Abiteboul 37
38 3. Active XML algebra
39 Motivation Relational model: centralized tables optimization: algebraic expression and rewriting Active XML model: distributed trees optimization: algebraic expression and rewriting Distributed query optimization based on algebraic rewriting of Active XML trees Based on experiences with AXML optimization P 2 P Data Management, 2006, S. Abiteboul 39
40 Active XML peers output stream We focus on positive AXML • • Set-oriented data Positive/monotone services π Services = tree-pattern-query-with-join queries join Services produce streams • • • Optimized by a local query optimizer Evaluated by a local query processor Out of our scope Local query processing π input stream P 2 P Data Management, 2006, S. Abiteboul input stream 40
41 The problem An AXML system • • A set of peers For each peer a set of documents and services Extensional data is distributed Intensional data (knowledge) is distributed – Defined using query services (TPQJ queries) – These services are generic: any peer can evaluate a query A query q to some peer Evaluate the answer to q with optimal response time P 2 P Data Management, 2006, S. Abiteboul 41
42 AXML algebra (AXML) algebraic expressions: l s@p E 1 E 2 … En send@p AXML logic eval@p #n@p’ E 1 d@p E 1 receive@p E 1 Each such expression lives at some peer Includes the AXML trees P 2 P Data Management, 2006, S. Abiteboul 42
43 Algebraic expressions annotations Executing service call: Terminated service call: Subtlety q@p(5): definition of intensional data eval(q@p(5)): request to evaluate it; during query optimization q@p(5): query is being evaluated; during query processing q@p(5): query evaluation is complete P 2 P Data Management, 2006, S. Abiteboul 43
44 Evaluation rules: local rules eval@p l t 1 → s@p t 2 … tn eval@p t 1 l → … t 2 eval@p tn ●s@p eval@p t 1 t 2 eval@p → eval@p E 1 … eval@p tn for l ≠ sc, s ≠ send, receive P 2 P Data Management, 2006, S. Abiteboul 44
45 Evaluation rules: transfer rules #x@p eval@p t 2 … eval@p’ receive@p → s@p’ t 1 new. Root()@p’ t 1 send@p’ & #x@p s@p’ t 2 … t 1 t 2 … Site p asks p’ to do the work and send the result to p P 2 P Data Management, 2006, S. Abiteboul 45
46 Evaluation rules: more transfer rules eval@p’ send@p’ #x@p eval@p’ send@p’ … → ●s@p’ … Z #x@p … #x@p & ●s@p’ … Z When a query is evaluated, results appear They are sent to the place that requested them Also some rules for eof P 2 P Data Management, 2006, S. Abiteboul 46
47 Evaluation Reminder: setting • • An AXML system A request to evaluate query q at peer p – eval@p( q ) Rewrite the trees in peer workspaces until termination of the process Results For positive XML, this process converges… to a possibly infinite state This process computes the answer to q May be fairly inefficient: need for optimization! P 2 P Data Management, 2006, S. Abiteboul 47
48 Optimization More rewrite rules to evaluate a query more efficiently
49 Query optimization Well-known optimization techniques for distributed data management Pushing selections Semijoin reducers Horizontal, vertical, hybrid decomposition Recursive query processing and query-subquery Some “specific” AXML optimizations Pushing queries over service calls Lazy service call evaluation … Optimizing subscription management All are captured by the algebraic framework P 2 P Data Management, 2006, S. Abiteboul 49
50 Example: pushing selections Suppose q ≡ q 1( (q 2)): eval@p q d@p 2 → q 1 (q 2(d@p 2)) eval@p → q 1@p → @p 2(q 2@p 2(d@p 2)) q 1@p eval@p 2 (q 2(d)) Same rule applies if d@p 2 is replaced by a continuous query P 2 P Data Management, 2006, S. Abiteboul 50
Example: interleaving of processing and optimization 51 At peer i: di = ri di+1 Query at p 1: (d 1) 1. (d 1) → (r 1) (d 2) eval@p 1( (r 1) (d 2)) → eval@p 1( (r 1)) eval@p 1( (d 2)) eval@p 1( (r 1)) → ● (r 1) (starts streaming data) 2. (d 2) → (r 2) (d 3); (r 2) starts streaming data 3. (d 3) → (r 3) (d 4) … (r 1) (r 2) (r 3) … (r 4) P 2 P Data Management, 2006, S. Abiteboul 51
52 Transfer and load balancing rules #x@p 1 eval@p 1 → E eval@p 1 send@p 1 new. Root@p 2() send@p 2 #x@p 1 eval@p 2 Peer p 1 delegates the evaluation of E to p 2 P 2 P Data Management, 2006, S. Abiteboul E 52
53 Transfer and load balancing rules #x@p 1 eval(E) → eval@p 1 send@p 1 new. Root@p 2() send@p 2 #x@p 1 eval@p 2 Peer p 1 delegates the evaluation of E to p 2 P 2 P Data Management, 2006, S. Abiteboul E 53
54 Transfer and load balancing rules #x@p 1 eval(E) new. Root@p 2() #x@p 1 → & send@p 2 #x@p 1 eval@p 2 E Peer p 1 delegates the evaluation of E to p 2 P 2 P Data Management, 2006, S. Abiteboul 54
55 Transfer and load balancing rules #x@p 1 eval(E) new. Root@p 2() #x@p 1 → & send@p 2 #x@p 1 eval(E) Peer p 1 delegates the evaluation of E to p 2 P 2 P Data Management, 2006, S. Abiteboul 55
56 Transfer and load balancing rules #x@p 1 eval(E) new. Root@p 2() #x@p 1 → eval(E) & send@p 2 #x@p 1 eval(E) Peer p 1 delegates the evaluation of E to p 2 P 2 P Data Management, 2006, S. Abiteboul 56
57 Transfer and load balancing rules #x@p 1 eval@p 1 → ← E eval@p 1 send@p 1 new. Root@p 2() send@p 2 #x@p 1 eval@p 2 Peer p 1 delegates the evaluation of E to p 2 P 2 P Data Management, 2006, S. Abiteboul E 57
Back to interleaved execution and optimization … 58 Repeated transfers (r 1) (r 2) (r 1) (r 3) (r 4) … Data transfers reduced More work for p 1: merging all the streams Hierarchical stream merging … … P 2 P Data Management, 2006, S. Abiteboul 58
Example: Horizontal and vertical decomposition 59 A relation d over ABC that is split both horizontally and vertically d = (d 1 d 2) d 3 d 1 = B<5 (d') and d 2 = B>=5 (d') d', d 1, d 2 over AB and d 3 over BC; each di is at a peer pi Consider the query B=0@p(d) → B=0@p( ( B<5 (d') B>=5 (d'))) d 3@p 3 ) → B=0 @p( d 1@p 1 d 3@p 3 ) → @p (#x@p receive(d 1@p 1) , #y@p receive(d 3@p 3) ) & send@p 1(#x@p; B=0@p 1(d 1@p 1) ) & send@p 3(#y@p; d 3@p 3) P 2 P Data Management, 2006, S. Abiteboul 59
60 Common sub-expression elimination eval@p(E), #x@p receive@p(E) → eval@p(#x@p), #x@p receive@p(E) eval@p E #x@p & receive@p E P 2 P Data Management, 2006, S. Abiteboul → eval@p #x@p & receive@p E 60
61 Common sub-expression elimination q@p’ s@p new. Root()@p q@p’ eval@p’ → #x@p’ eval@p’ receive@p’ s@p send@p & #x@p’ s@p → q@p’ #x@p’ #y@p’ & send@p receive@p’ #x@p’ s@p new. Root()@p’ new. Root()@p s@p & send@p’ #y@p’ #x@p’ s@p In spite of the two calls to s@p, the function is evaluated only once P 2 P Data Management, 2006, S. Abiteboul 61
62 Example: recursive query processing Using a pseudo Datalog syntax s 1@p($x, $y) d 2@p'($x, $z), s 2@p'($z, $y) s 2@p'($x, $y) d 1@p($x, $z), s 1@p($y, $z) After rewriting: on p : #x@p receive@p(q 1@p'(d 2@p', s 2@p') ) root@p send@p(#y@p', q 2@p( d 1@p, #x@p) ) on p' : root@p' send@p'(#x@p, q 1@p'(d 2@p', #y@p' receive@p'(s 2@p') ) ) P 2 P Data Management, 2006, S. Abiteboul 62
63 Generic and global services q@any: where q is a query • Any peer that has some query processor for q can do it f@any: where f is a processing service call • Example: decryption or gene comparison q over a P 2 P collection eval@p q @ eval@p coll → q @ index → eval@p q@p 1 q@p 2 q P 2 P Data Management, 2006, S. Abiteboul 63
64 The AXML algebra – conclusion Captures distributed XML query processing/optimization Based on a communication model a la CCS Algebraic – stream-oriented Orthogonal to the local XML query optimizer Orthogonal to the network support (DHT, small world etc. ) What is not yet available? A cost model and heuristics P 2 P Data Management, 2006, S. Abiteboul 64
65 Outline 1. Introduction – the data ring 2. Calculus for P 2 P data management (Active. XML) 3. Algebra for P 2 P data management (Active. XML algebra) 4. Indexing in P 2 P (Kado. P) 5. Conclusion P 2 P Data Management, 2006, S. Abiteboul 65
66 4. P 2 P XML indexing and query processing
67 Efficient evaluation of tree-pattern-queries Many optimization techniques We are interested here in distributed query evaluation/optimization 1) We consider XML indexing 2) Holistic twig join that is based on indexing 3) P 2 P indexing 4) P 2 P query processing 5) Optimizing P 2 P indexing P 2 P Data Management, 2006, S. Abiteboul 67
68 XML indexing: structural identifiers A B 2 6 1 3 D 4 E 5 6 2 2 G “John” 4 4 3 Structural IDs = 1 8 0 7 C 8 1 8 F 8 2 6 6 3 Prefix-Postfix -Level P 2 P Data Management, 2006, S. Abiteboul X ancestor of Y <=> pre(X) < pre(Y) and post(X) ≥ post(Y) X parent of Y <=> X ancestor of Y and level(X) = level(Y) - 1 68
69 Holistic Twig Join Input a document and a tree pattern query Find the bindings of the query in the document Holistic = holistique (le tout et pas juste les parties) Twig = brindille Join = you know… Sounds like Harry Potter? P 2 P Data Management, 2006, S. Abiteboul 69
70 Query evaluation over a document A C Ids for A (1, 8, 0)… D Ids for C Ids for D “John” Ids are sorted in lexicographical order Goals is to find “matching Ids” P 2 P Data Management, 2006, S. Abiteboul Ids for “John” 70
71 The Holistic Twig Join Algorithm a level 0 1 r (1, 25) a (2, 8) c (9, 14) b a (18, 25) a (15, 17) 2 a (3, 5 c (6, 8) b (10, 11) b (12, 14) a (16, 17) 3 c (4, 5) c (7, 8) c (11, 11) b (13, 14) c (17, 17) b (20, 21) 4 a (5, 5) a (8, 8) P 2 P Data Management, 2006, S. Abiteboul c (14, 14) b (19, 22) c (21, 21) c (22, 22) c c (23, 25) b (24, 25) c (25, 25) 71
The Holistic Twig Join Algorithm Stacks a a 7 Sa b b 5 b 6 b 4 a 1 a 2 a 3 b 1 Sb c 11 9 8 b 2 c 3 c 5 c 4 a 7 b 4 b 6 b 5 b 3 c 1 (a 7, b 4, c 8), (a 7, b 5, c 8), (a 7, b 4 , c 9) (a 7 , b 6 , c 11) a 5 a 6 a 4 72 c 10 c 6 c 7 c 8 c 9 c 11 c Sc Legend: This is the end Head of the stream Find the match for the query sub-tree determined by this node !!! The ID is present also in the stack P 2 P Data Management, 2006, S. Abiteboul 72
73 P 2 P XML processing
74 XML indexing in Xyleme History • • • 1999: INRIA research project 2000: Creation of a spin-off 2006: About 25 people Technology • • • A scalable XML repository A content warehouse On a cluster of Linux PC hash(C) LAN hash(“John) XML query processing • • • Twig join Index is distributed Keyword-based vs. document based P 2 P Data Management, 2006, S. Abiteboul Put(C; [d, p, 6, 6, 1]) Put(“John”; [d, p, 3, 1, 2]) 74
Query processing over a distributed collection A C 75 Ids for A (p 12, d 456, 1, 7, 0)… D Ids for C Ids for D “John” Ids include peer. Id and doc. Id Ids are sorted in lexicographical order Goals is to find “matching Ids” in the collection P 2 P Data Management, 2006, S. Abiteboul Ids for John 75
76 XML indexing in Kado. P Use structural Ids Publish them via a DHT • • • posting Distributed Hash Table Peers come and go Locate(k): : log(n) messages to fing the peer in charge of key k Put(k, v) Get(k): retrieves all the values for k We use Pastry • We also tried P 2 PSim and JXTA for C hash(C) DHT hash(“John) put(C; [d, p, 6, 6, 1]) put(“John”; [d, p, 3, 1, 2]) put(C; [d, p, 6, 6, 1]) P 2 P Data Management, 2006, S. Abiteboul 76
77 XML query processing in Kado. P Given a tree pattern query Q Evaluate an index query index. Q to locate the peers that can provide some answers • index. Q is a twig join Ship Q to these peers and evaluate it there If index. Q is imprecise, many false positive • • Example: ship Q to all peers (maximal parallelism) Example: Instead of structural Ids, just use (label/word, peer. Id, doc. Id) P 2 P Data Management, 2006, S. Abiteboul 77
78 Kado. P architecture External Layer Logical Layer Web interface Active. XML engine Semantic layer Kado. P peer publish & query Indexing Query processing Physical Layer P 2 P Data Management, 2006, S. Abiteboul Kado. P Engine DHT locate, put, get & delete Index 78
79 Some technical issues Our goal: manage millions of documents with a large number of peers First experiments were a disaster Replace the index storage of the DHT in a FS by storage in a database (Berkeley DB) Extend the API of the DHT to have Append and not only Read/Write Extend the API of the DHT to have a streaming exchange of postings • Useful because the XML algebra works better with streams Now it scales but there is the issue of long postings P 2 P Data Management, 2006, S. Abiteboul 79
The issue of long postings: Google in P 2 P 80 Using keyword distribution Suppose • • Peer for Ullman is in Europe Peer for XML is in US we have to ship one long posting between US and Europe For a large number of users, we absorb all the bandwidth of Internet backbone Need for replication Ullmann & xml? Ullman DHT xml Even for thousands of peers, the exchange of long postings is an issue P 2 P Data Management, 2006, S. Abiteboul 80
81 Intensional indexing in Kado. P Distributed B-tree long posting h(Name) Long posting = bad response time 1. No long posting f g h i 2. h(Name) get h(name) then parallel fetch 3. Possibility to optimize further f(doc. Id 55. . doc. Id 75) 5. may be it does not match 6. P 2 P Data Management, 2006, S. Abiteboul 4. no need to call f 81
82 More optimization • • • Standard for P 2 P keyword search – Gap compression and adaptive set intersection Standard distributed query optimization techniques – Ship smallest list – Load balancing – Caching – Replication Semi-join techniques notably Bloom semi-join P 2 P Data Management, 2006, S. Abiteboul 82
83 Outline 1. Introduction – the data ring 2. Calculus for P 2 P data management (Active. XML) 3. Algebra for P 2 P data management (Active. XML algebra) 4. Indexing in P 2 P (Kado. P) 5. Conclusion P 2 P Data Management, 2006, S. Abiteboul 83
84 6. Conclusion
85 Conclusion Logic for distributed data management • • Opinion: XQuery is a language for local XML management Proposal: Active. XML Algebraic foundation of distributed query optimization • Proposal: Active. XML algebra P 2 P (Active) XML indexing • Kado. P is now being tested and we are working on optimization Software • • • Active. XML is open-source – see activexml. net Kado. P soon will be – already available upon request EDOS distribution system as well P 2 P Data Management, 2006, S. Abiteboul 85
86 Lots of related work and related systems This is going very fast in system devepments Structured P 2 P nets: Pastry, Chord Content delivery net: Coral, Akamai XML repositories: Xyleme, DBMonet Multicas systemst: Avalanche, Bullet File sharing systems: Bit. Torrent, Kazaa Pub/Sub systems: Scribe, Hyper Distributed storage systems: Ocean. Store, Google. FS Etc. Fundamental research is somewhat left behind P 2 P Data Management, 2006, S. Abiteboul 86
87 Issues P 2 P query optimization P 2 P access control P 2 P archiving P 2 P self tuning P 2 P monitoring P 2 P knowledge management [Some. Where] Also: analysis and verification of these systems • E. g. , termination, error detection, diagnosis P 2 P Data Management, 2006, S. Abiteboul 87
88 Find your own topic Pick your favorite problem for data or knowledge management and study it in a P 2 P setting with gigabytes of data and thousands of peers If you find it boring, consider it with terabytes of data and millions of peers P 2 P Data Management, 2006, S. Abiteboul 88
89 Merci P 2 P Data Management, 2006, S. Abiteboul 89


