Скачать презентацию 1 Kado P a P 2 P content Скачать презентацию 1 Kado P a P 2 P content

5bd6eb6a26b52c59f41c29b6c172fb63.ppt

  • Количество слайдов: 33

1 Kado. P: a P 2 P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) 1 Kado. P: a P 2 P content sharing system Serge Abiteboul INRIA-Futurs (Orsay) and University Paris Sud ¨MDP 2 P – S. Abiteboul - 2006 1

2 Context MDP 2 P – Project “ Masse de Données en P 2 2 Context MDP 2 P – Project “ Masse de Données en P 2 P ” Kado. P: Joint work with Ioana Manolescu and Nicoleta Preda, INRIAFuturs (Orsay) and University Paris Sud (thesis of Nicoleta) Article in EDBT and demo in Data. Engineering ¨MDP 2 P – S. Abiteboul - 2006 2

3 Organization Introduction The basis: XML, DHT, Active. XML Kado. P Query processing The 3 Organization Introduction The basis: XML, DHT, Active. XML Kado. P Query processing The implementation Conclusion ¨MDP 2 P – S. Abiteboul - 2006 3

Introduction Introduction

5 Peer-to-peer A large and varying number of computers cooperate to solve some particular 5 Peer-to-peer A large and varying number of computers cooperate to solve some particular task without any centralized authority Goal: build an efficient, robust, scalable system based (typically) on inexpensive, unreliable computers distributed in a wide area network Examples • • seti@home: search for extraterrestrial intelligence kazaa: obtain free music/video over the net cabal: decryption of 512 bits RSA code grub: P 2 P Web search ¨MDP 2 P – S. Abiteboul - 2006 5

6 Data management in P 2 P Publication of resources (XML and knowledge) Storage 6 Data management in P 2 P Publication of resources (XML and knowledge) Storage of resources Access to resources Acquisition/Enrichment/Exploitation Focus here on query processing Peer Pee Internet Peer Precise answers taking into account the text, the structure and the semantics of XML documents. ¨MDP 2 P – S. Abiteboul - 2006 Peer 6

The basis: Standards + Active. XML + DHT The basis: Standards + Active. XML + DHT

8 Standards of distributed data management Standard for data exchange: XML • • Extensible 8 Standards of distributed data management Standard for data exchange: XML • • Extensible Markup Language Labeled ordered trees XML Standard for query languages • XPATH, Xquery Standards for distributed computing • Web services: SOAP, WSDL Active. XML = XML documents with embedded Web service calls • • Intensional Dynamic ¨MDP 2 P – S. Abiteboul - 2006 SOAP WSDL Xquery Xpath 8

9 Active. XML = XML + embedded service calls (omitting syntactic details) <resorts state=‘Colorado’> 9 Active. XML = XML + embedded service calls (omitting syntactic details) Aspen Unisys. com/snow(“Aspen”) 1 …. Yahoo. com/Get. Hotels() … May contain calls to any SOAP web service to any Active. XML web services ¨MDP 2 P – S. Abiteboul - 2006 9

Active. XML peer soap 10 Active. XML peer Each Active. XML peer • • Active. XML peer soap 10 Active. XML peer Each Active. XML peer • • • Repository Web client: Web server Open-source in Object. Web see http: //Active. XML. net Based on standards libraries • • • SUN’s Java SDK 1. 4 (XML parser, XPath processor, XSLT engine) Apache Tomcat 4. 0 servlet engine, Apache Axis SOAP toolkit 1. 0 X-OQL query processor (soon? Replaced by e. Xist XML-db) ¨MDP 2 P – S. Abiteboul - 2006 10

11 Distributed hash tables locate(k) put(k, v 1): hash(k) determines peer Ph(k) where (k, 11 Distributed hash tables locate(k) put(k, v 1): hash(k) determines peer Ph(k) where (k, v 1) is kept get(k) put(k, v 1) get(k) retrieves v 1, v 2… from Ph(k) delete(k, v) DHT Management of the overlay network is complex because peers come and go We use Pastry We have tried others: Chord, Jxta ¨MDP 2 P – S. Abiteboul - 2006 put(k, v 2) k: v 1, v 2 11

Kado. P: a P 2 P content sharing system Kado. P: a P 2 P content sharing system

13 Kado. P data items XML documents and web services • • XML sub-trees, 13 Kado. P data items XML documents and web services • • XML sub-trees, views and collections of documents Labels, words and stemming of these words Types • DTD and XSD for documents, WSDL for services also Active. XML documents and Active. XML services Ontologies • Concepts, isa, etc. ¨MDP 2 P – S. Abiteboul - 2006 13

14 Goal Find relevant information to answer a query May require some extensional information 14 Goal Find relevant information to answer a query May require some extensional information May require to call some Web services May require some elaborate query plan including service composition Simple examples • • • Find me Emacs packages that were modified last week Find me the packages depending on Emacs in my Linux system ¨MDP 2 P – S. Abiteboul - 2006 14

15 Kado. P architecture External Layer Logical Layer Web interface Active. XML engine Semantic 15 Kado. P architecture External Layer Logical Layer Web interface Active. XML engine Semantic layer Kado. P peer publish & query Indexing Query processing Physical Layer ¨MDP 2 P – S. Abiteboul - 2006 Kado. P Engine DHT locate, put, get & delete Index 15

16 Architecture Java/JSP application on each peer Kado. P: Distributed index Kadop EDOS distribution 16 Architecture Java/JSP application on each peer Kado. P: Distributed index Kadop EDOS distribution system Kadop Active. XML: Data/metadata storage IDi. P : dissemination management Bit. Torrent : efficient download ¨MDP 2 P – S. Abiteboul - 2006 16

P 2 P XML Query processing P 2 P XML Query processing

18 Efficient evaluation of tree-pattern-queries Many optimization techniques We are interested here in distributed 18 Efficient evaluation of tree-pattern-queries Many optimization techniques We are interested here in distributed query evaluation/optimization 1) We consider XML indexing 2) Holistic twig join that is based on indexing 3) P 2 P indexing 4) P 2 P query processing 5) Optimizing P 2 P indexing ¨MDP 2 P – S. Abiteboul - 2006 18

19 XML indexing: structural identifiers A B 2 6 1 3 D 4 E 19 XML indexing: structural identifiers A B 2 6 1 3 D 4 E 5 6 2 2 G “John” 4 4 3 Structural IDs = 1 8 0 7 C 8 1 8 F 8 2 6 6 3 Prefix-Postfix -Level ¨MDP 2 P – S. Abiteboul - 2006 X ancestor of Y <=> pre(X) < pre(Y) and post(X) ≥ post(Y) X parent of Y <=> X ancestor of Y and level(X) = level(Y) - 1 19

20 Holistic Twig Join Input a document and a tree pattern query Find the 20 Holistic Twig Join Input a document and a tree pattern query Find the bindings of the query in the document Holistic = holistique (le tout et pas juste les parties) Twig = brindille Join = jointure Sounds like Harry Potter? ¨MDP 2 P – S. Abiteboul - 2006 20

21 Query evaluation over a document A C Ids for A (1, 8, 0)… 21 Query evaluation over a document A C Ids for A (1, 8, 0)… D Ids for C Ids for D “John” Ids are sorted in lexicographical order Goals is to find “matching Ids” ¨MDP 2 P – S. Abiteboul - 2006 Ids for “John” 21

22 The Holistic Twig Join Algorithm a level 0 1 r (1, 25) a 22 The Holistic Twig Join Algorithm a level 0 1 r (1, 25) a (2, 8) c (9, 14) b a (18, 25) a (15, 17) 2 a (3, 5 c (6, 8) b (10, 11) b (12, 14) a (16, 17) 3 c (4, 5) c (7, 8) c (11, 11) b (13, 14) c (17, 17) b (20, 21) 4 a (5, 5) a (8, 8) ¨MDP 2 P – S. Abiteboul - 2006 c (14, 14) b (19, 22) c (21, 21) c (22, 22) c c (23, 25) b (24, 25) c (25, 25) 22

The Holistic Twig Join Algorithm Stacks a a 7 Sa b b 5 b The Holistic Twig Join Algorithm Stacks a a 7 Sa b b 5 b 6 b 4 a 1 a 2 a 3 b 1 Sb c 11 9 8 b 2 c 3 c 5 c 4 a 7 b 4 b 6 b 5 b 3 c 1 (a 7, b 4, c 8), (a 7, b 5, c 8), (a 7, b 4 , c 9) (a 7 , b 6 , c 11) a 5 a 6 a 4 23 c 10 c 6 c 7 c 8 c 9 c 11 c Sc Legend: This is the end Head of the stream Find the match for the query sub-tree determined by this node !!! The ID is present also in the stack ¨MDP 2 P – S. Abiteboul - 2006 23

24 Also: Intensional data Example: include and references Example: function calls in Active. XML 24 Also: Intensional data Example: include and references Example: function calls in Active. XML Find me the packages depending on Emacs in my Linux system package (name, author, size, signature, depends. On(self)…) the depending packages are intensional Naïve: return empty answer Brutal: return all documents with a function call What we do: use indexing (and typing) ¨MDP 2 P – S. Abiteboul - 2006 24

The implementation The implementation

26 Some technical issues Common belief: this cannot work because of transfer delays • 26 Some technical issues Common belief: this cannot work because of transfer delays • • • Indeed, first experiments were a disaster DHT did not scale – not designed for so many entries Transfers of long posting lists were killing the system Our target : make it work in some modest setting with millions of documents with thousands of peers with not too volatile peers (not Kazaa or Google. Search but industrial application) ¨MDP 2 P – S. Abiteboul - 2006 26

27 Let’s make it work Some of the early observations of MDP 2 P 27 Let’s make it work Some of the early observations of MDP 2 P and solutions • • • Replace the index storage of the DHT in a FS by storage in a database (Berkeley DB) Extend the API of the DHT to have Append and not only Read/Write Extend the API of the DHT to have a streaming exchange of postings (for long postings) – Useful because the XML algebra works better with streams Now Kado. P scales but can be optimized We will see here one optimization technique: DPP ¨MDP 2 P – S. Abiteboul - 2006 27

Distributed Posting Partitioning h(Name) 28 Distributed B-tree long posting Long posting = bad response Distributed Posting Partitioning h(Name) 28 Distributed B-tree long posting Long posting = bad response time 1. No long posting f g h i 2. h(Name) get h(name) then parallel fetch 3. Possibility to optimize further f(doc. Id 55. . doc. Id 75) 5. may be it does not match 6. ¨MDP 2 P – S. Abiteboul - 2006 4. no need to call f 28

29 Performance ¨MDP 2 P – S. Abiteboul - 2006 29 29 Performance ¨MDP 2 P – S. Abiteboul - 2006 29

30 Main issues Scaling: Optimize query processing • • Adapting Bloom filter and other 30 Main issues Scaling: Optimize query processing • • Adapting Bloom filter and other known techniques on going in Gemo Scaling: main tool is replication • • Issue are consistency and overhead On going work in MDP 2 P/Atlas Dynamicity: better manage peers entering/leaving the system ¨MDP 2 P – S. Abiteboul - 2006 30

A Kado. P application: Data management in Edos 31 The distributed of a large A Kado. P application: Data management in Edos 31 The distributed of a large software to the peers developing in Mandriva Linux distribution: 10 000 packages + metadata between up to 1 000 peers Thousands of packages (about 9000 in Mandriva) • • Package metadata in XML And why not: bug reports, annotations, emails, etc. Goal: distribute & query & monitor & get. Package Techno: Active. XML + Kado. P + Idip + Bit. Torrent ¨MDP 2 P – S. Abiteboul - 2006 31

32 Conclusion V 1 of Kado. P and Edos. Distribution are running • Open-source 32 Conclusion V 1 of Kado. P and Edos. Distribution are running • Open-source Management of XML resources in P 2 P • • Management of semantic and web services as well Based on active data (Active. XML) and DHT (Free. Pastry) Novelties in Kado. P • • • Management of data and knowledge in P 2 P Use of intensional information Original optimization techniques Future work • ANR Platform for content management: web. Content ¨MDP 2 P – S. Abiteboul - 2006 32

33 Merci ¨MDP 2 P – S. Abiteboul - 2006 33 33 Merci ¨MDP 2 P – S. Abiteboul - 2006 33