- Количество слайдов: 16
YZStack: Provisioning Customizable Solution for Big Data Sai Wu, Chun Chen, Gang Chen, Lidan Shou, Ke Chen Zhejiang University Hui Cao, He Bai, yz. Big. Data Co. Lte. City Cloud Technology
3 H Problem in Deploying the Big Data System • How can I build and deploy a big data system without back-ground knowledge? • How can I migrate existing applications to the big data system? • How can I use my big data system to do the analysis job?
Too Many Choices • Visualization : – Openstack – Cloudstack – Vmware • Cloud storage: – key-value store (hbase, cassandra, redis, …) – relational service (AWS, spanner, …) • Processing engine: – – – Map. Reduce/Hadoop Dryad Pregel, Graph. Lab Spark epi. C • Application service: – Mahout – Hive – Spatial Hadoop
Can I Deploy a Big Data System Like Installing a Windows Software? • Configure the installation as a customization process • The installation software will copy the binary codes to all servers and do the configuration automatically • A browser-based management system to start/stop the services and monitor the status
YZStack: the Architecture • • Layers are loosely connected Each layer includes many selectable modules Modules of different layers are linked via the common interfaces Optimizations are implemented as special plugins
Features of YZStack • Adaptive Image – Based on openstack, partition the big image into small chunks – Different images share the same chunk • Optimization Plugins – – Column-oriented plugin Index plugin Query optimization plugin Iterative job plugin • Visualization Tool – Zoom in/out for different dimensions
Use Case: the Smart Financial System • Built for the Zhejiang Provincial Department of Finance (ZPDF)
Economic Prediction • Collaborate with researchers from college of economics, Zhejiang University • Step 1: – Use the OLAP module to provide a basic view for each registered company
Economic Prediction (cont. ) • Step 2: – Healthy Model: Based on the historical data, the healthy model discovers risks and predicts prospects of an industry – Energy Consumption Model: We link the financial data with the electronic, water, and environment data to rank each industry based on its energy consumption per unit of output value. – Economic Impact: Model By connecting the financial data to the human resource data, we study how many workers are employed for an industry and their average salary – Combine all three models to rank all industries accordingly
Economic Prediction (cont. ) • Step 3: Index of Economic (ongoing work) – To predict the status of the whole Zhejiang Province using statistics generated by previous two steps – Involving multiple complex economic models – Our economic researchers are using the visualization tools to build and study their models
Detection of Improper Payment • What is the improper payment? – A person is classified as the low-income type and buys a house specially for low-and-medium wage earners. However, he is actually employed by IT company – One company may submit different registration files to different government departments (e. g. , it registers as a high-tech company in the Department of Science, but as a labor-intensive one in the Department of Labor) to enjoy various allowances from the government.
Why ZPDF? • A harbor of financial data in Zhejiang Province – Electronic department – Traffic department – Tax department –… • It is well motivated – Expected to save more than 1 billion CNYs
Improper Payment • Step 1 (Consistent Problem): – To detect improper payment from two databases, D 0 and D 1, – we first generate two star-join queries, Q 0 and Q 1, which selectively merge the fact tables with the dimension tables. – The trick is that the entities returned by Q 0 should not exist in the results of Q 1. – E. g. , Q 0 returns the high-income persons, while Q 1 returns the users who own a house specially for lowand-medium wage earners.
Consistent Problem • we apply the LSH (Locality Sensitive Hashing) to generate k hash values for each tuple from T 0 and T 1. • So the tuples sharing the same hash value are considered as a candidate group. • We define a similarity function sim(ti; tj) to evaluate the probability of two tuples representing the same entity. If sim(ti; tj) is greater than a predefined threshold, it will be forwarded to the verification module where a human-aided algorithm is applied to filter out the false positives.
Conclusion • YZStack is tailored for the users who have little or no experience in deploying and maintaining the cloud system. • It simplifies the development of a new big data application as the process of module selection and customization. • To show the flexibility and usability of YZStack, we demonstrate how we build a smart financial system for the Zhejiang Provincial Department of Finance using YZStack.