A Hybrid Multicast-Unicast Infrastructure for Efficient Publish-Subscribe in

A Hybrid Multicast-Unicast Infrastructure for Efficient Publish-Subscribe in Enterprise Networks Danny Bickson, Ezra N. Hoch, Nir Naaman and Yoav Tock IBM Haifa Research Lab, Israel

IBM Haifa Research Lab Outline ³ Motivation ³ The channelization problem ³ Our hybrid approach ³ Experimental results ³ Conclusions 2

IBM Haifa Research Lab Motivation: large scale publish subscribe application ³ Large number of information flows (topics) and subscribers ³ Each flow must be delivered to a subset of interested subscribers ³ Example: financial market data dissemination ² Publisher divides data feed into a large number information flows, (~100 K) e. g. stock symbols, futures, commodities ² Many stand-alone subscribers (~1 K) ² Subscribers display interest heterogeneity are interested in different yet overlapping subsets of the topics ² Any single topic may be delivered to a large number of subscribers (hot / cold topics) 3 WAN Data Vendor Subscribers Publisher Enterprise LAN Multiple information flows (Topics)

IBM Haifa Research Lab Common approaches ³ Use unicast (point-to-point) connections ² Limitations: poor utilization of network resources (duplicate transmissions) ³ Use broadcast (single multicast channel) ² Limitations: receivers filter unwanted content ³ Utilize multicast to transmit data ² Topics are mapped into multicast groups. Each user joins the groups that cover his topic-interest. ² Reduces receiver filtering ² Limitations: limited amount of multicast addresses ±Network element state problem ±Receiver resources (NICs) 4

IBM Haifa Research Lab Our novel contribution ³ Create a hybrid approach that combines both multicast and unicast ² Flexible allocation of transmissions ² Topics with high interest enjoy efficiency of multicast ² Topics with low interest are transmitted in unicast ³ Formalize as an optimization problem ² Propose a two step alternating method for computing the resource allocation 5

IBM Haifa Research Lab The Channelization Problem ³n flows ³Flow rates λ ³k multicast groups ³m users ³Interest matrix W The task: find mapping matrices X, Y that minimizes the communication cost ³The cost of transmission – take into account transmission to multiple groups ³The cost of reception – minimize excess filtering 6

IBM Haifa Research Lab The Hybrid Channelization Problem Flows X – flow to group map Multicast Groups G 1 G 2 Y – user subscription map Users F 1 Gk Interest Extraction (W) F 2 U 1 F 2 F 3 U 2 F 1 F 2 F 8 F 4 U 3 F 4 F 6 Fn Um F 1 Fn T – unicast transmission map 7

IBM Haifa Research Lab The Hybrid Channelization Problem ³ Modified cost function Cost of multicast reception Cost of multicast transmission Cost of unicast reception & transmission ³ Problem objective is 8

IBM Haifa Research Lab Proposed Solution ³ Unfortunately the hybrid problem is NP-hard ³ We propose a two step heuristic solution ² First step: solve the channelization problem (multicast mapping) ² Second step: ±Choose flow-user pairs for unicast, ±Remove redundant assignments from multicast mapping ±Recalculate the cost ² Iterate until convergence, or unicast BW limit exceeded 9

IBM Haifa Research Lab First step: channelization problem solution ³ We have experimented with the following algorithms ³ K-Means (2005) performs best 10

IBM Haifa Research Lab K-Means Mapping Algorithm ³ Input ² Interest matrix, topic rate vector ³ Basic insight ² Put “similar” topics in the same group ² “Similar” topics have a similar audience causes less filtering Interest Matrix = Topics v Users x x v v x x x v v v x User’s Interest Vector Rate Vector = ³ Take the rate into account x x x Topic’s Audience Vector R 1 R 2 … RK T 1 T 2 ³ Iterative Clustering Algorithm (K-means) ² Init: Topics are assigned into a fixed number of groups ² Move: In each step, remove a single topic, and move it to the best group – the one producing the lowest cost ² Cost: After each epoch, compute total filtering cost ² Stop: cost doesn’t improve | time elapsed | max # iter. 11 ? T 3 T 4 T 5 T 6 T 5 ? T 7 T 8 ? T 9

IBM Haifa Research Lab Second step: choosing user-flow pairs for unicast ³ Experimented with several heuristics ² Heavy users - all transmission to a specific heavy user is sent using unicast ² Lightweight flows - flows with low bandwidth are sent using unicast ² Greedy flows - move to unicast the flow which best minimizes the total cost ² Greedy users - move to unicast the user which best minimizes the total cost ² An additional heuristic - Greedy user-flow pairs – move to unicast the user-flow pair which best minimizes the total cost - very slow, impractical run-time 12

IBM Haifa Research Lab Experimental results ³ Construction of user-interest matrix W ² Random, uniform ² Market distribution – based on a model of NYSE stock volume ² IBM Web. Sphere cell – a real system 13

IBM Haifa Research Lab Channelization algorithms ² K-Means (2005) performs best ² Takes rate into account ² Gradient decent on the true cost function 14

IBM Haifa Research Lab Effect of the interest matrix on channelization performance ³ The interest and rate have a significant effect on channelization performance ³ Some interests have patterns that are easy to “channelize” ³ Interests with less entropy, more order, are easier 15

IBM Haifa Research Lab Hybrid Algorithm Heuristics Unicast BW limit – algorithm will use optimal amount up to the limit Market dist. - Greedy users Can use more unicast BW 16 Web. Sphere dist. - Greedy flows Doesn’t need more than 20% unicast BW

IBM Haifa Research Lab Hybrid using greedy flow – unicast / multicast tradeoff ³ Every interest and rate distribution has an optimal amount of unicast BW it can use ³ The hybrid approach improves upon both unicast-only and multicat -only Unicast BW allocation – exact amount of unicast BW used 17

IBM Haifa Research Lab Conclusions ³ We have presented a novel hybrid approach for publish subscribe ³ We have shown using extensive and realistic simulation results that our approach reduces consumed network and host resources ³ K-Means (2005) performs best for channelization, from the selection of algorithms we tested ³ Greedy hybrid heuristics performed best in our tests ³ Relative competitiveness of the greedy-flows & greedy-users heuristics depends on the structure of the interest matrix and rate 18 ~ The End ~

IBM Haifa Research Lab Real Life Messaging Load Model ³ Model based on ³ ³ ³ statistical analysis of NYSE daily trade data 20 K Topics 500 Subscribers Avg. ~70 flows / user Min 15 flows / user Max 115 flows / user Avg. message fan out ~10. 1 clients ³ Multicast - message is transmitted once ³ Unicast transmitter data rate is x 10 of multicast ! 19 Backup – Model

IBM Haifa Research Lab Messaging Load Model – Based on Market Research ³ Financial front office ²Hundreds of users, requiring stock quotes and financial information from several markets ³ Topic space structure ²Within each market, symbol popularity and rate are exponentially distributed (NYSE market research) ²Several different markets, with Avg. popularity and size prop. ~1/m (assumption). ² 20 K flows, 10 markets, 500 users ³ User interest ²Each user: selects some markets, selects a percent of the symbols from each chosen market, according to the said distributions 20 Backup – Model ~10% of Symbols ~55% of trade

IBM Haifa Research Lab Interest Matrix Mapping Algorithm Topics ³ Input ² interest matrix, topic rate vector v x x x v v v Users ³ Basic insight ² Put “similar” topics in the same group ² “Similar” topics have a similar audience ² A group with a homogenous audience causes less filtering ³ Take the rate into account ² The cost of putting two topics in the same group ² The cost of adding a new topic to a group of topics Topics with identical audience Topics with similar audience Topics 1 Users 1 2 3 4 2 v v x x x v v x R 2 0 R 1+ R 2 Rk – the rate of topic k 21 Filtering Cost Backup – Algorithm

IBM Haifa Research Lab T 1 Iterative Clustering Algorithm (K-means) ³ Init: Topics are assigned into a fixed number of groups ³ Move: In each step, remove a single topic, and move it to the best group – the one producing the lowest cost ³ Cost: After each epoch, compute total filtering cost ³ Stop: time elapsed | cost does not improve | exceeded max number of iterations Topic group 1 2 3 Users v v v x x x v v x x Group audience vector v v x x Candidate topic 5 v v v x The cost of adding topic 5 to topic group {1, 2, 3} 0 0 0 R 5 R 1+R 2+R 3 0 R 1+R 2+R 3+R 5 22 Backup – Algorithm T 2 ? T 3 T 4 T 5 T 6 T 5 ? T 7 T 8 ? T 9 The best group for topic K is the group with the lowest cost