Скачать презентацию When small data is better data Paul Francis Скачать презентацию When small data is better data Paul Francis

99b862778c4719db590bbc094d109540.ppt

  • Количество слайдов: 34

When small data is better data Paul Francis, MPI-SWS Ruichuan Chen, Ekin Akkus, Johannes When small data is better data Paul Francis, MPI-SWS Ruichuan Chen, Ekin Akkus, Johannes Gehrke

private When small data is better data Paul Francis, MPI-SWS Ruichuan Chen, Ekin Akkus, private When small data is better data Paul Francis, MPI-SWS Ruichuan Chen, Ekin Akkus, Johannes Gehrke

The user data “understanding” • There is a tacit understanding among users that if The user data “understanding” • There is a tacit understanding among users that if you send data to a company, they are free to use it how they wish OK Facebook “knowing” all kinds of personal information LESS OK NOT OK Doubleclick monitoring your browsing behavior Google gathering WLAN traffic during drive-by

Leads to data gathering services • Companies build (free) services designed to gather as Leads to data gathering services • Companies build (free) services designed to gather as much data about users as they can – And often secretly gather data about users when they can’t • Then try to monetize that data – Mainly through advertising – Though Jean Bolot had some interesting ideas

Leads to “big data” (mining) • Companies gather what they can, but don’t always Leads to “big data” (mining) • Companies gather what they can, but don’t always get what they want – Google knows your searches, but not your relationship status – Facebook knows your relationship status, but not what you buy – Amazon knows what you buy, but not what you search for • So they use big data mining to infer what they don’t know

A new user data understanding • It is ok to monetize (or otherwise benefit) A new user data understanding • It is ok to monetize (or otherwise benefit) from user data if: – The user data is very expensive to collected in any identifiable form – Users can know what is going on, and users can opt-out

Why is this interesting? • Keeping user data on the user device is the Why is this interesting? • Keeping user data on the user device is the key to user privacy • Most user data is at, or has passed through, the user device – Search and browsing in browser history – Facebook user profile easily scraped – Amazon purchases easily scraped

Premise of “Private by Design” • If we can monetize user data, without collecting Premise of “Private by Design” • If we can monetize user data, without collecting user data, then we have legitimate access to far more user data – Less need to deal with big data – Better monetization, less overhead

My group’s research agenda • “Private by Design” behavioral advertising • “Private by Design” My group’s research agenda • “Private by Design” behavioral advertising • “Private by Design” aggregate analytics

My group’s research agenda • “Private by Design” behavioral advertising • “Private by Design” My group’s research agenda • “Private by Design” behavioral advertising • “Private by Design” aggregate analytics

Aggregate Analytics • Web analytics: want to know demographics of user base, what other Aggregate Analytics • Web analytics: want to know demographics of user base, what other websites users visit, etc. • App analytics: want to know what other apps user runs (competitors) • Mobile analytics, general analytics, ….

Typical database privacy settings: trusted component sees database Untrusted Analyst query Database’ Query Module Typical database privacy settings: trusted component sees database Untrusted Analyst query Database’ Query Module (add noise) query Database anonymize Trusted Database

Our setting: nobody (except user) sees individual user data Untrusted Analyst ? ? ? Our setting: nobody (except user) sees individual user data Untrusted Analyst ? ? ? Untrusted Data

Previous work in our setting • Assumed differential privacy • Poor scaling characteristics, and/or Previous work in our setting • Assumed differential privacy • Poor scaling characteristics, and/or • Could not tolerate user fraud Analyst Our goal: Assume differential privacy, but fix scaling and user fraud problems. ? ? ? Data

Differential privacy • Differential privacy adds noise to the output of a computation (i. Differential privacy • Differential privacy adds noise to the output of a computation (i. e. , query). Database Query Module (add noise) DB 1 DB 2 (differs by one user) Analyst

Components & assumptions Analyst is potentially malicious (violating user privacy) Analyst Proxy (add DP Components & assumptions Analyst is potentially malicious (violating user privacy) Analyst Proxy (add DP noise blindly) Data Proxy is honest but curious 1) Follows the specified protocol (does not collude) 2) Tries to exploit additional info that can be learned in so doing Clients are user devices. Clients are potentially malicious (distorting the final results)

Actually, two proxies! Honest-but-Curious proxy must not see user data Analyst Blind Proxy Data Actually, two proxies! Honest-but-Curious proxy must not see user data Analyst Blind Proxy Data If one proxy, need expensive public key encryption between clients and analyst If two proxies, can use much cheaper form of encryption (one time pad)

Message XOR Random_String = Result Proxy 1 Result Sender Result Random_String Receiver Proxy 2 Message XOR Random_String = Result Proxy 1 Result Sender Result Random_String Receiver Proxy 2 Result XOR Random_String = Message

Queries are counting queries: Analyst Blind Proxy Data Ex: How many users…. . are Queries are counting queries: Analyst Blind Proxy Data Ex: How many users…. . are male and between ages of 10 -20?

Analyst Blind Proxy Data Clients answer ‘yes’ or ‘no’ only Analyst Blind Proxy Data Clients answer ‘yes’ or ‘no’ only

Analyst Proxies adds N additional random yes/no answers (coins) Blind Proxy N = 2σ2 Analyst Proxies adds N additional random yes/no answers (coins) Blind Proxy N = 2σ2 But, must not know how many yes’s and no’s it added! Data

Analyst Each proxy independently adds N random coins Blind Proxy Data XOR at analyst Analyst Each proxy independently adds N random coins Blind Proxy Data XOR at analyst will produce random result But neither proxy knows what the result will be

Analyst Coins and answers Blind Proxy Data Analyst Coins and answers Blind Proxy Data

Analyst Blind Proxy Data Decrypt and tabulate Blind Proxy Data Analyst Blind Proxy Data Decrypt and tabulate Blind Proxy Data

Buckets • Not “is your age between 10 -20? ”, – but “are you Buckets • Not “is your age between 10 -20? ”, – but “are you 1? ”, “are you 2? ”, “are you 3? ”…. • Query is generally a vector of yes/no questions – Answer a vector of 1’s and 0’s • Vector can be big: – List of 20 K websites – 185 K combinations of 10 of 20 attributes

Analyst Blind Proxy b 1: u 4, u 12, c 2, …… b 2: Analyst Blind Proxy b 1: u 4, u 12, c 2, …… b 2: u 6, c 3, u 19, …… b 3: u 12, c 7, u 6, …… Blind Proxy Proxies add coins and shuffle user answers (per bucket) u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… ……. Data

b 1: u 4, u 12, c 2, …… b 2: u 6, c b 1: u 4, u 12, c 2, …… b 2: u 6, c 3, u 19, …… b 3: u 12, c 7, u 6, …… u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… ……. Analyst Blind Proxy Data b 1: u 4, u 12, c 2, …… b 2: u 6, c 3, u 19, …… b 3: u 12, c 7, u 6, …… Blind Proxy Data u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… …….

b 1: u 4, u 12, c 2, …… b 2: u 6, c b 1: u 4, u 12, c 2, …… b 2: u 6, c 3, u 19, …… b 3: u 12, c 7, u 6, …… u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… ……. Blind Proxy Analyst b 1: u 4, u 12, c 2, …… b 2: u 6, c 3, u 19, …… b 3: u 12, c 7, u 6, …… Blind Proxy u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… ……. The shuffling at each proxy must be identical (though random) Because each. Data must Data paired bit be Data with its XOR partner

b 1: u 4, u 12, c 2, …… b 2: u 6, c b 1: u 4, u 12, c 2, …… b 2: u 6, c 3, u 19, …… b 3: u 12, c 7, u 6, …… u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… ……. Analyst Blind Proxy b 1: u 4, u 12, c 2, …… b 2: u 6, c 3, u 19, …… b 3: u 12, c 7, u 6, …… Blind Proxy u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… ……. But the proxies may have a (slightly) different set of answers. Data

Analyst u 1: b 1, b 2, b 3, …… u 2: b 1, Analyst u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… ……. Blind Proxy Synchronize the list of answers. Share a random seed for a Data random number generator, use Data to shuffle. u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… …….

Time • Queries (unfortunately) take time: – There is a period of time during Time • Queries (unfortunately) take time: – There is a period of time during which a query is active • 10 s of minutes, hours, or days? ? ? Start query Synchronize and add coins TIME Clients pull in and answer queries

Differential Privacy, good and bad • Good: – Adds noise – Lots of machinery Differential Privacy, good and bad • Good: – Adds noise – Lots of machinery being built • Bad: – Very pessimistic (measure of privacy loss is almost certainly way worse than actual privacy loss) – “Throwing away the database” not realistic

From INTIMATE workshop • Jean’s mobility a good application • Collaborative filtering (Bach, Aruna) From INTIMATE workshop • Jean’s mobility a good application • Collaborative filtering (Bach, Aruna) looks hard to do • Serge’s social knowledge may be centered on user devices… – Query for people’s opinions… • Real-time analytics may be possible – Streamed coin addition? ? ?

Status and future • Building an application analytics tool – Initial focus is PC Status and future • Building an application analytics tool – Initial focus is PC platforms – Hope to get real app developers to bundle our tool • Additional privacy mechanisms (beyond differential privacy) • Work on better understanding of privacy loss in a realistic setting