Скачать презентацию Big Data Processing in Social Networks 社群網路中之巨量資料處理 陳銘憲 Скачать презентацию Big Data Processing in Social Networks 社群網路中之巨量資料處理 陳銘憲

61991448df02a56d0db03e0d07ce7a2d.ppt

  • Количество слайдов: 62

Big Data Processing in Social Networks 社群網路中之巨量資料處理 陳銘憲 (Ming-Syan Chen) 中央研究院 資訊科技創新 研究中心 September Big Data Processing in Social Networks 社群網路中之巨量資料處理 陳銘憲 (Ming-Syan Chen) 中央研究院 資訊科技創新 研究中心 September 2, 2014

A Few Words before the Talk p Well, Big Data is one of the A Few Words before the Talk p Well, Big Data is one of the most popular topics world-wide these days No. of attendants of KDD doubled this year p Talk materials are from (1) my prior talks (Keynotes/invited talks in PAKDD 14, WAIM 13, KDD 12), and (2) my recent research works; So probably subjective M. -S. Chen 2

Outline p Walkthru on Big Data p Information Extraction from a Social Network Graph Outline p Walkthru on Big Data p Information Extraction from a Social Network Graph p Issues to Address M. -S. Chen 3

The Era of Big Data is Coming p由 『 全球瘋雲 』 『 到 巨資時代 The Era of Big Data is Coming p由 『 全球瘋雲 』 『 到 巨資時代 』 ! p Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization (Gartner) p 迅速累積 的 大量 異質 資料 With unclear veracity Source of intelligence (value) M. -S. Chen 4

M. -S. Chen 5 Source from Intel: What Happens In An Internet Minute http: M. -S. Chen 5 Source from Intel: What Happens In An Internet Minute http: //www. intel. com/content/www/us/en/communications/internet-minute-infographic. html

Big data happens in every minute • 639, 800 GB of global IP data Big data happens in every minute • 639, 800 GB of global IP data transferred 204 million emails sent Flicker p 3, 000 photo uploaded p 20 million photo views You. Tube p 30 hours of video uploaded p 1. 3 million video views Linked. In p 100+ new accounts Twitter 320+ new twitter accounts 100, 000 new tweets Facebook 6 millions views 277, 700 logins Google 2+ million search queries M. -S. Chen 6

Data Amount fueled by SN Activities p Twitter p Facebook p One billion users Data Amount fueled by SN Activities p Twitter p Facebook p One billion users Amazon Co-purchasing Network p 150+ million members 50 million tweets per day From twitter. om half million product nodes several million recomm. links Web Pages Yahoo! Over one billion Web Pages M. -S. Chen 7 Amazon From SNSP

Example of Big Data and Social Network Volume: thousands of people! Velocity: fast accumulated!! Example of Big Data and Social Network Volume: thousands of people! Velocity: fast accumulated!! Variety: eating different food!!! M. -S. Chen 8

Example of Big Data and Social Network For some gossip in this occasion, Veracity Example of Big Data and Social Network For some gossip in this occasion, Veracity is an issue and the information Value could be low. Mr. Lin won the lottery! Mrs. Chang just did a face lift! M. -S. Chen 9

Some Views on Big Data p Big data white paper: “Challenges and Opportunities with Some Views on Big Data p Big data white paper: “Challenges and Opportunities with Big Data” p Mc. Kinsey: “Big data: The next frontier for innovation, competition, and productivity” p By researchers in major univ. and IT companies in US http: //www. cra. org/ccc/files/docs/init/bigdatawhitepaper. pdf http: //www. mckinsey. com/insights/business_technology/big_ data_the_next_frontier_for_innovation NYTimes: “The age of Big Data” (potential use and cost) http: //www. nytimes. com/2012/02/12/sunday-review/big-datas 10 -impact-in-the-world. html? pagewanted=all&_r=0 M. -S. Chen

Views on Big Data (cont’d) p IBM (platform, technology and applications) p Microsoft: “Perspective Views on Big Data (cont’d) p IBM (platform, technology and applications) p Microsoft: “Perspective from the fourth paradigm for scientific discovery” p http: //research. microsoft. com/enus/collaboration/fourthparadigm/4 th_paradigm_book_co mplete_lr. pdf VMware (platform and system architecture) p http: //www-01. ibm. com/software/data/bigdata/ http: //blogs. vmware. com/vfabric/2012/08/4 -key-architectureconsiderations-for-big-data-analytics. html More (from SAS, Intel, Oracle, etc. on-line) M. -S. Chen 11

So, is the Notion of Big Data New? Depends on whom you ask p So, is the Notion of Big Data New? Depends on whom you ask p In fact, when more funds are available for big data issues, people jump out to claim themselves big data people p 一個 名詞 , 各自表述 p If we read the Big Data white paper from US, its scope is quite close to that of data mining p Of course, not considered a consensus here 12

Similar Rationale behind Data Mining and Big Data p Knowledge discovery from a huge Similar Rationale behind Data Mining and Big Data p Knowledge discovery from a huge amount of data p extraction of interesting knowledge (rules, regularities, patterns, constraints) from data in large databases In line with technology trend! HW, storage, CPU MIPS, network BW, Cloud, etc Intelligence and personalization will be key for differentiation 13

Characteristics of Big Data Knowledge discovered from big data Improving decision quality, optimizing process, Characteristics of Big Data Knowledge discovered from big data Improving decision quality, optimizing process, and gaining insights in general (tied to domain) Usually not considered as an isolated biz. sector, not analogue to oil p Slightly different from traditional business intelligence BI: more on data with high information density Big data: more on data with low information density; more application oriented p M. -S. Chen 14

Example Big Data Applications 金融 保險業 信用評等、客製化金融服務、授信、客戶之資產管理、壞帳分析、道德危機分析、逆向選擇風險分析、 潛在客戶名單分析 (credit analysis, insurance policy, etc) 零售業 Example Big Data Applications 金融 保險業 信用評等、客製化金融服務、授信、客戶之資產管理、壞帳分析、道德危機分析、逆向選擇風險分析、 潛在客戶名單分析 (credit analysis, insurance policy, etc) 零售業 (含電子 商務) 即時輔助購買決策之依據 (via proper recommendation),並且提供貨品、架位、物流整合及配置之輔助決策 支援系統 (e. g. , 7 -11) (EC is an emerging area!) 製造業 生產過程中作為最佳化生產因素決定之專家輔助決策系統,並且提供最佳化之存貨控管與供應鏈暨顧客 利潤率分析 連鎖業 作為展店店址之選擇,以及分店貨品品項選擇,並且作為物流倉庫位址決策輔助 具,以及物流產能輔 助配置之依據 (e. g. , Mc. Donald, etc) 醫療業 醫療作業成本管理之動因分析、作為醫療分析、或病患個人化服務之來源 電信業 提供最佳化之網路交通配置,暨、客製化服務,並且提供即時之線上客製化輔助資訊系統、客製化之入 口網站及輔助促銷功能 ; operation analysis (e. g. , alarm system analysis) due to system scale 生技業 提供研發平台以及分析所需 具,加速累積研發能量 (Genome analysis) 教育業 作為潛在學生之來源名單分析,並且運用資訊勘測作為入學申請暨獎學金申請評等之分析,及學生課程 規劃與職涯規劃之依據 (e. g. , MOOC) 廣告業 廣告點閱來源分析、回應率分析、行銷策略提供 (augmented with LBS in mobile devices) in Various Business Sectors 非營利組織 M. -S. Chen 作為勸募捐款信函與通信之聯繫名單方式 (including SN analysis) 15

Some More Words on Big Data p Primary sources of big data: Social network Some More Words on Big Data p Primary sources of big data: Social network activities Internet of things (i. e. , from sensor networks) Multimedia (mainly video) p New methods are required to overcome new challenges imposed by the big data Streaming data, unstructured data, data from various sources, etc Traditional RDB cannot handle efficiently M. -S. Chen 16

Tool: source: http: //www. bigdata-startups. com/open-source-tools/ Tool: source: http: //www. bigdata-startups. com/open-source-tools/

Now, Big Data in a Social Network A social network is usually composed of Now, Big Data in a Social Network A social network is usually composed of millions of nodes and links (homogeneous or heterogeneous) p The huge (volume), fast changing (velocity), and diversified (variety) information in a social network imposes very challenging issues for researchers to manage and analyze p From twitter. om

Outline p Walkthru on Big Data p Information Extraction from an SN Graph p Outline p Walkthru on Big Data p Information Extraction from an SN Graph p Issues to Address (In this part, we shall use examples to illustrate the concepts. Those who are interested in technical details are referred to related publications. ) M. -S. Chen 19

Graph Extraction 執簡御繁 To handle complicated things with simple skills. Application/goal-oriented data extraction Three Graph Extraction 執簡御繁 To handle complicated things with simple skills. Application/goal-oriented data extraction Three levels of information extraction from SNs p Parameter extraction (e. g. , company stat. ) Fast calculation of closeness centrality (ICDM 13) p Feature extraction (e. g. , company biz. ) Activity willingness optimization (VLDB 14) p Structure extraction (e. g. , company org. ) Decomposing SN graphs (Asonam 14) M. -S. Chen

Structure extraction weapon Feature extraction M. -S. Chen (regarding 21 Structure extraction weapon Feature extraction M. -S. Chen (regarding 21

Outline p Walkthru on Big Data p Information Extraction from an SN Graph Capturing Outline p Walkthru on Big Data p Information Extraction from an SN Graph Capturing key parameters (parameter extraction) Activity willingness optimization (feature extraction) Decomposing SN graphs (structure extraction) p Issues to Address M. -S. Chen 22

Closeness centrality There are several interesting quantities, including closeness centrality, network diameters, degree distribution, Closeness centrality There are several interesting quantities, including closeness centrality, network diameters, degree distribution, in SN graphs. p Closeness centrality of node v, Cc(v): the inverse of the average shortest path distance from v to any other node in a network. If Cc(v) is large, v is around the center as it • requires only few hops to reach others. M. -S. Chen 23

Response to Dynamic Changes p It is frequent to have edge insertion or deletion Response to Dynamic Changes p It is frequent to have edge insertion or deletion in a social network It is desirable to fast update the closeness centrality of every node in response to edge insertion/deletion. p Example use: pick a number of people (the nodes with high CCs) who can maximize advertisement effectiveness. M. -S. Chen 24

Example of Closeness Centrality Cc(v): the inverse of the average shortest path distance from Example of Closeness Centrality Cc(v): the inverse of the average shortest path distance from v to other nodes Thus, node w is closer to all other node than the node v. M. -S. Chen An unweighted and undirected graph 25 G with 14 nodes and 18 edges

Calculating Closeness Centrality p Note that only some pairs of shortest paths will be Calculating Closeness Centrality p Note that only some pairs of shortest paths will be affected due to certain edge changes. Identify them (unstable node pairs) for fast calculation of CC M. -S. Chen 26

Example For example, with the addition of (a, b) Un-changed shortest paths ◦ p(b, Example For example, with the addition of (a, b) Un-changed shortest paths ◦ p(b, v), p(c, t) and p(r, h), etc. Changed shortest paths ◦ ◦ Before edge insertion p(a, b)={a, d, w, b}, p(a, c)={a, d, w, r, c} and p(u, v)={u, l, o, d, w, r, s, v}, etc. After edge insertion (we then call these nodes unstable) p(a, b)={a, b}, p(a, c)={a, b, c} and p(u, v)={u, x, a, b, c, v}, etc. (a): the original unweighted and undirected graph G. (b): G’=G∪e(a, b). M. -S. Chen 27

Illustration of Unstable Node Pairs p To find V’u : u-unstable node set, whose Illustration of Unstable Node Pairs p To find V’u : u-unstable node set, whose shortest paths to u changed after the edge addition unstable node pairs: (u, b), (u, c), (u, h), (u, v) and (u, t). G’u 亦即那些到 u 點最短距離會變動之 點 Gu V’u={b, c, h, v, t} M. -S. Chen 28

(Main Theorem) After the addition of edge (a, b), every unstable node pair (whose (Main Theorem) After the addition of edge (a, b), every unstable node pair (whose shortest path changed) {v, u} will have v ∈ V’a and u ∈ V’b V’a. . Only these shortest paths will change after edge addition (and need to be re-calculated)

Remark Experiments were done with Hadoop (Map. Reduce) in DBLP dataset p With fast Remark Experiments were done with Hadoop (Map. Reduce) in DBLP dataset p With fast calculation of closeness centrality, the shortest paths preserving sparsification can be done efficiently by identifying those edges whose removal least affect CC. p The design of new algorithms is called for to efficiently calculate other key parameters in the fast changing social network p M. -S. Chen 30

Outline p Walkthru on Big Data p Information Extraction from an SN Graph Capturing Outline p Walkthru on Big Data p Information Extraction from an SN Graph Capturing key parameters (parameter extraction) Activity willingness optimization (feature extraction) Decomposing SN graphs (structure extraction) p Issues to Address M. -S. Chen 31

Evolution of Activity Formation Information extracted has been shown to be helpful for activity Evolution of Activity Formation Information extracted has been shown to be helpful for activity formation in social networks p Socio-Spatial Group Query [Yang, etal, KDD -12] p Considering time, social and spatial factors As more and more information can be mined from a social network, we can take the user interest (i. e. , willingness) into consideration when planning an activity [Shuai, etal, VLDB-14] M. -S. Chen 32

ts 1 2018/3/19 Mike Lee Tony Wang Peter Chen Jack Lin Jane Lee Grace ts 1 2018/3/19 Mike Lee Tony Wang Peter Chen Jack Lin Jane Lee Grace Yang John Chen Mary Fang O O O ts 2 O O ts 3 O O O O O M. -S. Chen O ts 4 O O O O ts 5 O O O O ts 6 O O 33

What Can be Done Further? Time+Social+Spatial (Heterogeneous SN) Wow! I found a Let me What Can be Done Further? Time+Social+Spatial (Heterogeneous SN) Wow! I found a Let me ask good restaurant some friends to with buy-2 -getcome for this 2 free for lunch. great deal! 2018/3/19 34

Alice, Bob, Cindy, John, Mary Activity location Total travel distance: 10 km 2018/3/19 35 Alice, Bob, Cindy, John, Mary Activity location Total travel distance: 10 km 2018/3/19 35

Implementation of SSGQ Group size Activity Location 2018/3/19 Familiarity Constraint 36 Implementation of SSGQ Group size Activity Location 2018/3/19 Familiarity Constraint 36

Implementation of SSGQ (cont’d) Selected Group Attendee’s current locations 2018/3/19 37 Implementation of SSGQ (cont’d) Selected Group Attendee’s current locations 2018/3/19 37

Ongoing Experiments on Facebook (with willingness considered) Ongoing Experiments on Facebook (with willingness considered)

Outline p Walkthru on Big Data p Information Extraction from an SN Graph Capturing Outline p Walkthru on Big Data p Information Extraction from an SN Graph Capturing key parameters (parameter extraction) Activity willingness optimization (feature extraction) Decomposing SN graphs (structure extraction) p Issues to Address M. -S. Chen 39

Diffusion Analysis in Social Networks p Diffusion of Information can be used to model Diffusion Analysis in Social Networks p Diffusion of Information can be used to model the interaction among nodes in a network, e. g. , Viruses spread over the internet. Disease spread in the community. Rumors/news spread among humans. M. -S. Chen 40

Example Diffusion p Information diffusion can happen in social networks, such as facebook and Example Diffusion p Information diffusion can happen in social networks, such as facebook and twitter. 1 3 0 2 Underlying network Path of Infection M. -S. Chen 41

The Network is Hidden p In some situations, the underlying network is not known The Network is Hidden p In some situations, the underlying network is not known (due to cost or privacy issue). p Network inference problem (NIP) is studied to discover the underlying network To infer the network from what happened. M. -S. Chen 42

Network Inference Problem p M. -S. Chen 0 2 43 1 Network Inference Problem p M. -S. Chen 0 2 43 1

Clustering Cascades p Traditionally, NIP assumes there is one underlying network, which may not Clustering Cascades p Traditionally, NIP assumes there is one underlying network, which may not always be true in reality e. g. , Sports news, political news, and entertainment news are likely to spread in different ways p Hence, we would like to cluster cascades so that the cascades in each cluster spread in the same pattern An SN graph is hence decomposed into application-specific ones M. -S. Chen 44

Example Cascades Cascade a (Lakers news) Cascade b (49 ers news) 0 0 1 Example Cascades Cascade a (Lakers news) Cascade b (49 ers news) 0 0 1 1 Cascade d (Heats news)Cascade e (Jets news) 2 0 0 1 3 1 2 M. -S. Chen Cascade c (Redskins news) 1 2 0 Cascade f (Celtics news) 1 0 45

To Model Inference Network (as before) p 46 To Model Inference Network (as before) p 46

Possible Inference Network (obtained by traditional method) 0. 25 0. 5 0. 17 0. Possible Inference Network (obtained by traditional method) 0. 25 0. 5 0. 17 0. 5 0. 67 0. 25 0. 67 0. 17 0. 5 0. 25 M. -S. Chen 47

To Cluster Cascades by K-Means p 48 To Cluster Cascades by K-Means p 48

Graph Decomposition p By considering cascades {a, d, f} and cascades {b, c, e} Graph Decomposition p By considering cascades {a, d, f} and cascades {b, c, e} independently (based on which nodes are infected), the original SN graph is decomposed in accordance with the information carried. Cascades {b, c, e} (NFL) Cascades {a, d, f} (NBA) 0. 25 0. 5 0. 67 0. 33 0. 5 0. 17 M. -S. Chen 49

Remark Traditionally NIP results in a dense and complex network, which is difficult to Remark Traditionally NIP results in a dense and complex network, which is difficult to capture knowledge. p By properly clustering cascades, we can have a few resulting concise networks which carry clearer information p These resulting networks better match the corresponding cascades than a single dense network. M. -S. Chen 50

Outline p Walkthru on Big Data p Information Extraction from an SN Graph Capturing Outline p Walkthru on Big Data p Information Extraction from an SN Graph Capturing key parameters (parameter extraction) Activity willingness optimization (feature extraction) Decomposing SN graphs (structure extraction) p Issues to Address M. -S. Chen 51

Issues to Address p Issues which either uniquely occur, or will become more prevalent, Issues to Address p Issues which either uniquely occur, or will become more prevalent, in social networks 2018/3/19 To discuss those from the perspective of (1) users, (2) events, (3) time, (4) platform, and (5) data M. -S. Chen 52

Issues to Address (1 st, on Users) p From collaborative filtering to social filtering Issues to Address (1 st, on Users) p From collaborative filtering to social filtering Traditional collaborative filtering (CF) is used in recommendation system. Recently, with the prosperity of social network sites, social filtering (SF) becomes more prevalent. p The social network services required will be very user-dependent and human centric 2018/3/19 M. -S. Chen 53

Use CF for Recommendation ? recommend similar 2018/3/19 M. -S. Chen 54 Use CF for Recommendation ? recommend similar 2018/3/19 M. -S. Chen 54

Use SF for Recommendation (i. e. , letting your friends decide) ? recommend friends Use SF for Recommendation (i. e. , letting your friends decide) ? recommend friends This cake is AWESOME! 2018/3/19 M. -S. Chen 55

Issues to Address (2 nd, on Events) p Bridging real and virtual lifes e. Issues to Address (2 nd, on Events) p Bridging real and virtual lifes e. g. , construction of weighted SR graph p Mismatch for confidence level The confidence level of the social relationship discovered might not be high (quite subjective and Adhoc) p e. g. , reading the same book (1 pt), having lunch together (2 pts), going movie together (3 pts), etc 2018/3/19 However, proper weighting may vary from one person to another M. -S. Chen 56

Issues to Address (3 rd, on Time) p Streaming mining for real-time decisions (no Issues to Address (3 rd, on Time) p Streaming mining for real-time decisions (no single snapshot) 天下 武功 惟快不破 p Not only summarize the social information, but also find the trend of evolution (2 nd order mining) Mining on summarized data e. g. , Not just discover what is the favorite song of Tom. Rather, to learn the fact that Tom changed his favorite every 3 months 2018/3/19 M. -S. Chen 57

Issues to Address (4 th, on Platform) p With the availability of mobile devices Issues to Address (4 th, on Platform) p With the availability of mobile devices and the paradigm shift to cloud computing, everyone will have 1 Gb for comm. , unlimited storage, and access to data source world-wide à leading to the era of “superman” (with diff. ways of thinking and doing things) 超人新時 代 à Will have even faster increase in the variety of social network activities, in particular those related to LBS M. -S. Chen 58

Issues to Address (5 th, on Big Data) p To process the big data Issues to Address (5 th, on Big Data) p To process the big data (i. e. , a hugh volume of fast increasing (velocity) data of different types (variety) with unclear veracity and domain-dep value p To integrate different data sources e. g. , locations of photo shot, user purchase behavior, his/her SN involved p Objective: Volume, Velocity Subjective: Variety, Veracity, Value 2018/3/19 M. -S. Chen 59

Other Important Issues p Mining-assisted social media content management Service with more intelligence required Other Important Issues p Mining-assisted social media content management Service with more intelligence required p Privacy-preserving on social information processing p …more 2018/3/19 M. -S. Chen 60

Conclusion p Due to the paradigm shift to cloud computing and the fast increase Conclusion p Due to the paradigm shift to cloud computing and the fast increase in the availability of mobile devices, big data processing in social network is having an unprecedented impact to our life p Key factors for the arrival of the big data era: Mobile, Social network, and Cloud 2018/3/19 M. -S. Chen 61

Thank you! 2018/3/19 M. -S. Chen 62 Thank you! 2018/3/19 M. -S. Chen 62