99b862778c4719db590bbc094d109540.ppt
- Количество слайдов: 34
When small data is better data Paul Francis, MPI-SWS Ruichuan Chen, Ekin Akkus, Johannes Gehrke
private When small data is better data Paul Francis, MPI-SWS Ruichuan Chen, Ekin Akkus, Johannes Gehrke
The user data “understanding” • There is a tacit understanding among users that if you send data to a company, they are free to use it how they wish OK Facebook “knowing” all kinds of personal information LESS OK NOT OK Doubleclick monitoring your browsing behavior Google gathering WLAN traffic during drive-by
Leads to data gathering services • Companies build (free) services designed to gather as much data about users as they can – And often secretly gather data about users when they can’t • Then try to monetize that data – Mainly through advertising – Though Jean Bolot had some interesting ideas
Leads to “big data” (mining) • Companies gather what they can, but don’t always get what they want – Google knows your searches, but not your relationship status – Facebook knows your relationship status, but not what you buy – Amazon knows what you buy, but not what you search for • So they use big data mining to infer what they don’t know
A new user data understanding • It is ok to monetize (or otherwise benefit) from user data if: – The user data is very expensive to collected in any identifiable form – Users can know what is going on, and users can opt-out
Why is this interesting? • Keeping user data on the user device is the key to user privacy • Most user data is at, or has passed through, the user device – Search and browsing in browser history – Facebook user profile easily scraped – Amazon purchases easily scraped
Premise of “Private by Design” • If we can monetize user data, without collecting user data, then we have legitimate access to far more user data – Less need to deal with big data – Better monetization, less overhead
My group’s research agenda • “Private by Design” behavioral advertising • “Private by Design” aggregate analytics
My group’s research agenda • “Private by Design” behavioral advertising • “Private by Design” aggregate analytics
Aggregate Analytics • Web analytics: want to know demographics of user base, what other websites users visit, etc. • App analytics: want to know what other apps user runs (competitors) • Mobile analytics, general analytics, ….
Typical database privacy settings: trusted component sees database Untrusted Analyst query Database’ Query Module (add noise) query Database anonymize Trusted Database
Our setting: nobody (except user) sees individual user data Untrusted Analyst ? ? ? Untrusted Data
Previous work in our setting • Assumed differential privacy • Poor scaling characteristics, and/or • Could not tolerate user fraud Analyst Our goal: Assume differential privacy, but fix scaling and user fraud problems. ? ? ? Data
Differential privacy • Differential privacy adds noise to the output of a computation (i. e. , query). Database Query Module (add noise) DB 1 DB 2 (differs by one user) Analyst
Components & assumptions Analyst is potentially malicious (violating user privacy) Analyst Proxy (add DP noise blindly) Data Proxy is honest but curious 1) Follows the specified protocol (does not collude) 2) Tries to exploit additional info that can be learned in so doing Clients are user devices. Clients are potentially malicious (distorting the final results)
Actually, two proxies! Honest-but-Curious proxy must not see user data Analyst Blind Proxy Data If one proxy, need expensive public key encryption between clients and analyst If two proxies, can use much cheaper form of encryption (one time pad)
Message XOR Random_String = Result Proxy 1 Result Sender Result Random_String Receiver Proxy 2 Result XOR Random_String = Message
Queries are counting queries: Analyst Blind Proxy Data Ex: How many users…. . are male and between ages of 10 -20?
Analyst Blind Proxy Data Clients answer ‘yes’ or ‘no’ only
Analyst Proxies adds N additional random yes/no answers (coins) Blind Proxy N = 2σ2 But, must not know how many yes’s and no’s it added! Data
Analyst Each proxy independently adds N random coins Blind Proxy Data XOR at analyst will produce random result But neither proxy knows what the result will be
Analyst Coins and answers Blind Proxy Data
Analyst Blind Proxy Data Decrypt and tabulate Blind Proxy Data
Buckets • Not “is your age between 10 -20? ”, – but “are you 1? ”, “are you 2? ”, “are you 3? ”…. • Query is generally a vector of yes/no questions – Answer a vector of 1’s and 0’s • Vector can be big: – List of 20 K websites – 185 K combinations of 10 of 20 attributes
Analyst Blind Proxy b 1: u 4, u 12, c 2, …… b 2: u 6, c 3, u 19, …… b 3: u 12, c 7, u 6, …… Blind Proxy Proxies add coins and shuffle user answers (per bucket) u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… ……. Data
b 1: u 4, u 12, c 2, …… b 2: u 6, c 3, u 19, …… b 3: u 12, c 7, u 6, …… u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… ……. Analyst Blind Proxy Data b 1: u 4, u 12, c 2, …… b 2: u 6, c 3, u 19, …… b 3: u 12, c 7, u 6, …… Blind Proxy Data u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… …….
b 1: u 4, u 12, c 2, …… b 2: u 6, c 3, u 19, …… b 3: u 12, c 7, u 6, …… u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… ……. Blind Proxy Analyst b 1: u 4, u 12, c 2, …… b 2: u 6, c 3, u 19, …… b 3: u 12, c 7, u 6, …… Blind Proxy u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… ……. The shuffling at each proxy must be identical (though random) Because each. Data must Data paired bit be Data with its XOR partner
b 1: u 4, u 12, c 2, …… b 2: u 6, c 3, u 19, …… b 3: u 12, c 7, u 6, …… u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… ……. Analyst Blind Proxy b 1: u 4, u 12, c 2, …… b 2: u 6, c 3, u 19, …… b 3: u 12, c 7, u 6, …… Blind Proxy u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… ……. But the proxies may have a (slightly) different set of answers. Data
Analyst u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… ……. Blind Proxy Synchronize the list of answers. Share a random seed for a Data random number generator, use Data to shuffle. u 1: b 1, b 2, b 3, …… u 2: b 1, b 2, b 3, …… u 3: b 1, b 2, b 3, …… …….
Time • Queries (unfortunately) take time: – There is a period of time during which a query is active • 10 s of minutes, hours, or days? ? ? Start query Synchronize and add coins TIME Clients pull in and answer queries
Differential Privacy, good and bad • Good: – Adds noise – Lots of machinery being built • Bad: – Very pessimistic (measure of privacy loss is almost certainly way worse than actual privacy loss) – “Throwing away the database” not realistic
From INTIMATE workshop • Jean’s mobility a good application • Collaborative filtering (Bach, Aruna) looks hard to do • Serge’s social knowledge may be centered on user devices… – Query for people’s opinions… • Real-time analytics may be possible – Streamed coin addition? ? ?
Status and future • Building an application analytics tool – Initial focus is PC platforms – Hope to get real app developers to bundle our tool • Additional privacy mechanisms (beyond differential privacy) • Work on better understanding of privacy loss in a realistic setting
99b862778c4719db590bbc094d109540.ppt