7dd608439baa3da28b3abb0eb78f8195.ppt
- Количество слайдов: 18
Dynamic XML documents with Distribution and Replication Authors : Serge Abiteboul, Angela Bonifati, Grégory Cobéna, Ioana Manolescu, Tova Milo As summarized by : Preethi Vishwanath San Jose State University Computer Science
Dynamic XML documents where some data is given explicitly while other parts are given only intentionally by means of embedded calls to web services that can be called to generate the required information. – SOAP and WSDL normalize the way programs can be invoked over the Web, and become the standard means of publishing and accessing dynamic, up-to-date sources of information. – May be distributed and/or partially distributed. Whether dynamic or static, XML document may be – Distributed in several parts located at different peers, while maintaining the general unity of the separated pieces – Partially or entirely replicated on different peers.
Aspects of distribution due to embedding calls to a Web Service (1) Accessing remote services: Such a document provides the means to access remote services. This feature is already provided by platforms supporting embedded scripts in HTML/XML documents, e. g. , JSP, ASP. Net. (2) Replicating data fragments with embedded service calls: a call included in a replicated fragment may be activated from the replica’s site, following a rather different communication path. (3) Replicating service definitions: A special form of replication may be achieved by replicating not only data, but also service definitions. This is in the spirit of “code-shipping”.
Context of paper and Contributions Dynamic XML documents (XML documents including calls to Web services) that are possibly distributed over several sites, with portions of them possibly replicated. Contributions (1) Model. Introduce a simple model for replicating and distributing XML documents over several sites. The model may be used for standard or dynamic documents. In general, users querying distributed/replicated data prefer to ignore data location and expect the system to locate data for them. But it is sometimes desirable to specify which replicas of a given fragment to use (e. g. , the one in the local cache, or the most recent one). (2) Query evaluation and optimization. In the presence of replicas and distribution, many evaluation strategies are possible for a given query, depending on the choice of the replica to use, and of the sites performing each elementary computation. Typically, several peers will collaborate to evaluate a query; each involved peer will have to make choices in order to improve its observable performance, based on a cost metric specific to this peer. (3) Tailored replication. To improve its observable performance, a peer may be willing to replicate some data, possibly including service calls, and even service definitions, as explained above. Such replication is subject to natural constraints (e. g. , storage space).
Data Model & Query Language Dynamic XML Documents – May be viewed as labeled tree. – Tree nodes represent the XML elements/attributes. , edges represent relationship. – Function elements, represent calls to the Web Services – Opaque “SOAP-based Web services, black boxes” – Declarative web services, implementation is known and described in terms of XQuery. Peers – Offers some Web services and contains some dynamic XML document which may include calls to services provided by the same or other peers. Distribution – May include calls to services provided by the same or other peers. – A higher level of data distribution can be achieved by allowing a document to be distributed over several peers. – Tree data model : means that document nodes may now have external children edges pointing to children nodes on other peers, and analogously, an external parent edge if the parent of the node is on another peer. Replication of data and services – Same document fragment exists in several peers. – All children of the same node with the same ID are considered replicas of a single node.
A dynamic XML Document of the SKI Portal <document name =“Ski. Portal”> <state_name> Colorado </state_name> <resorts> <resort ID=“Asp. Resort”> <name> Aspen </name> <snow_cond ID=“Asp. Sc”> good <fun peer=“Unisys. Weather” fname=“Snow. Conditions” frequency = “every round hour” validity = “last”> <params> <resort> Aspen </resort> </params> </fun> </snow_cond> <hotels ID = “Asp. Hotels”> <hotel> …. </hotel> </hotels> </resort> <resort> …. . </resort> …… </resorts> </state> <state> …. . </state> …. </document> Web Services of Ski Portal function Operative. Ski. Resorts($state) implementation: XQuery for $x in document(”Ski. Portal”)/state[state name=$state] /resorts/resort[snow cond/value()=”good”] return $x function Hotels. Info($state, $resort) implementation: XQuery for $x in document(”Ski. Portal”)/state[state name=$state] /resorts/resort[name=$resort]/hotels/hotel return $x
(1) (2) If the two functions were opaque and the resort knows nothing about their internal implementation, there are essentially two possibilities: Call the ski portal each time a service is needed and have the portal compute the answer and return it, or cache the returned result and use it for some time, trading communication cost for data accuracy. Query Frequency – – By analyzing the Operative. Ski. Resorts query, we can see that its answer may change only every hour - when the Snow. Conditions functions is invoked. Hence, to give fully accurate answers to its visitors, the ski center needs to invoke the function every hour, and cache data in between. Replicating relevant data and services – Assume that the Colorado ski center computer is capable of (1) storing dynamic XML documents, (2) invoking the web service calls embedded in them, and (3) processing XQuery queries. – Rather than just caching the current query result, one could then decide to replicate (and maintain) in the ski center computer all the relevant data, and provide a local version of the service queries.
The Colorado dynamic document and services <document name =“Colorado. Ski. Center”> <resort ID=“Asp. Resort”> <res_name>Aspen </res_name> <snow_cond> good <fun peer=“Unisys. Weather” fname=“Snow. Conditions” frequency = “every round hour” validity = “last”> <params> <resort> Aspen </resort> </params> </fun> </snow_cond> <hotels ID = “Asp. Hotels”> <hotel> …. </hotel> </hotels> </resort> <resort> …. . </resort> …… </document> function Operative. Ski. Resorts(“Colorado”) implementation: XQuery for $x in document(”Colorado. Ski. Center”)/resort[snow cond/value()=”good”] return $x function Hotels. Info(“Colorado”, $resort) implementation: XQuery for $x in document(” Colorado. Ski. Center”)/resort[name=$resort]/hotels/hotel return $x
Partial Replication Replicate just the resort names and their ski conditions, without the hotels data, and just provide access to this data through the ski portal, when needed. The external. URL sub-element of the hotels element, together with the ID, indicate where the data of this element may be found. The external edge is simply viewed as an intensional description of this missing data and gives the means to obtain it if needed.
The Colorado document with external edges <document name =“Colorado. Ski. Center”> <resort ID=“Asp. Resort”> <res_name>Aspen </res_name> <snow_cond> good <fun peer=“Unisys. Weather” fname=“Snow. Conditions” frequency = “every round hour” validity = “last”> <params> <resort> Aspen </resort> </params> </fun> </snow_cond> <hotels ID=“Asp. Hotels”> <external. URL> http: //www. ski. com/Ski. Portal </external. URL> </hotels> </resort> <resort> …. . </resort> </document> Inverse External Edges <document name=”Ski. Portal”>. . . <resort ID=”Asp. Resort”> <snow cond ID=”Asp. SC”> <LRUlanretxe> http: //www. HS. com/Colorado. Ski. Center </LRUlanretxe>. . . </snow cond> <hotels ID=”Asp. Hotels”> <LRUlanretxe> http: //www. HS. com/Colorado. Ski. Center </LRUlanretxe>. . . </hotels> </resort>. . . </document>
Master-Slave Policy Maintaining consistency over replicated objects difficult. Typical solution – – Have each object owned by a single master who is in charge of maintaining the various copies in sync. If the various copies are the children of a single element, then this element is the candidate for being in charge of synchronization. Example <document name=”Ski. Portal”_ <state> <state name> Colorado </state name> <hotels ID=”Asp. Hotels” status=”stale”> <external. URL status=”master”> http: //www. HS. com/Colorado. Ski. Center </external. URL> <hotel>. . . </hotels> </state>. . . </document>
Queries Each element encountered in the evaluation of a path expression, on a given peer p, may contain some data (residing on that peer), and may also point (via external edges) to some replicas (on different peers). Which of the Element versions should be used ? – Ignore all the external edges and consider only the data residing within the given peer p. – use the element’s local data as well as follow all the given external edges to its replicas, in order to get the maximal available information. – Intermediate choice : Choose some arbitrary copy consider the element’s local data when available, and follow an external edge – Follow a particular edge – Give a preference list Example : A Replicated query for $x in document(”Ski. Portal”)/state[state name=”Colorado”] /resorts/resort replicate $x with resort name//* snow cond//* hotels as external link at peer ”http: //www. HS. com/Colorado. Ski. Center”
COST MODEL Configuration A set of peers, each containing some data and providing some web services (opaque or XQuery-based ones) Workload (for a configuration) System workload consists of the service calls invoked by the dynamic documents in the configuration, as well as of queries/web service requests posed by users at the various peers. Unifying user queries and services – Consists of the invocation of web services entailed by the dynamic documents, and queries and web services requested by the user.
Decomposing Queries on Peers The processing of Q can thus be viewed as decomposed into several “intra-peer” subqueries: each sub-query is evaluated on a particular peer, consulting only the peer’s local data, and communication with other peers in order to forward some finer subqueries or send/receive data or computation results. P 1 Q
Cost Formulas for calculating the data used by a given workload on a set of peers Mi, j = δi, j * Oj * min(Fi, Fj) D = TL*M*L Computation, Communication and storage costs incurred by the workload Cj. Glob. Comp={Comp*L}j*cpj CGlob. Receiv[s] = D*BWIN CGlob. Send[s] = TBWOUT * D Cj. Glob. Space = {Space * L} *spj Where Mi, j is the volume of data transferred from one query Wi to another query Wj D represents the volume of data transferred from peer Pi to peer Pj due to all queries in W Cj. Glob. Comp is the observable cost of computation CGlob. Receiv[s] is the observable cost of received data CGlob. Send[s] is the observable cost of sent data Cj. Glob. Space is the observable cost of space, resp. , of peer j
Outline of Query Evaluation Peer Pi has to execute a simple path expression Q Q some data in P 1 and some in P 2. P adopts the heuristic of executing as much of Q as possible, say Qlocal, obtaining an intermediate result, and delegates one or several further subqueries Qnext to one or several other peers Pnext. Each Pnext will receive the intermediate results and continue processing, by applying the same method: attempt to evaluate all Qnext and, if all data is not available, delegate further. Data Shipping vs Query Shipping Communication Pattern – At each step the sub-query Q includes the address of the peer P on which Q was originally asked, so that the result is returned directly to P, since it requires less communication. – Drawback next – Wrappers decide how much of the decide how much of query sent by the mediator they solve. – The mediator has global information about data location, and all wrappers report directly to it. – Control over execution is distributed. All peers get to know who initiated the query
Replicating data and services For a given configuration and workload, every peer measures its observable performance In order to improve its observable performance, the peer may want to change the configuration; due to peer autonomy, the peer can only modify his own set of data and services. Possible replication scenarios that peer P may consider, – Accessing remote information (do not replicate) When not all the data needed for the query evaluation resides on " , it may need to consult remote data, for instance via external links If the query frequency is high and storage cost at the given peer is low, " may prefer to replicate the relevant data and use a local version rather than the remote one. – Replicating data fragments with or without service calls Scenario 1 – – P may take the replicated fragment including the service calls embedded in it; thus P will call the service itself. Alternatively , P may leave (some of) the calls to be executed at the remote peer, and just refer to the data they return via external links Scenario 2 – – Cost Effective Example if the service provider charges some fee from the caller, leaving the call on the remote peer spares " from this fee; or, if the call is invoked more frequently than the query that uses its data, its output is transmitted to " at the frequency of the query rather than that of the call invocation, thus entailing less communication. – Replicating service definitions When the data is replicated together with its embedded calls, we may want to also replicate, for declarative services, the code of the called services as well as the data that they use Things become more complex when service definitions are replicated. One has to decide – – if and how to modify the service code to best fit the needs of P, Which data the code uses, and how much of it to replicate, and recursively, for which service calls appearing in this replicated data, the code (and the data that it uses) should be also replicated.
Replication Algorithm rep. Decision Input: configuration con f, service implementation Q Output: configuration con f 1 con f, rep. Data 0 foreach path expression pe over docin Q pe is of the form l 1[c 1]/l 2[c 2]/…lk // evaluate pe by top-down navigation in doc foreach step j in the evaluation of pe, j = 1, 2, …. , k pe, Q 1 . . /lj+1/lj+2/. . /lk if exists {sc|sc child of a node in the current node list, sc is a call to a service sv, whose output type may contain a path lj+1/…/lk} {sc|sc sv, lj+1/…/lk} then rep. Data the set of subtrees rooted at the current node list con f 1 con f U rep. Data U Q 1 if cost(con f 1) < cost(con f) then foreach sv 1 call of service in rep. Data con f 1 rep. Decision(con f 1, def(sv 1)) endfor break // stop here for evaluation of pe else nop; endfor // the evaluation of pe is over if (empty ( rep. Data) // rep. Data has not yet been assigned rep. Data) rep. Data the result of pe on doc con f 1 con f U rep. Data foreach sv 1 call of service in rep. Data con f 1 rep. Decision(con f 1, def(sv 1)) endfor return con f 1
7dd608439baa3da28b3abb0eb78f8195.ppt