TDD Topics in Distributed Databases Data cleaning ü

TDD: Topics in Distributed Databases Data cleaning ü Discovering data quality rules (Chapter 3) ü Error detection (Chapter 3) ü Data repairing (Chapter 3) ü Taking record matching and data repairing together (Chapter 7) ü Certain fixes (Chapter 7) 1

A platform for improving data quality Business rules Master data profiling validating Validation error detecting dependencies data repairing automatically discover rules record matching certain fixes standardization Dirty data Clean Data data accuracy data enrichment A practical system is already in use data accuracy monitoring 2

TDD: Research Topics in Distributed Databases Data cleaning ü Discovering data quality rules ü Error detection ü Data repairing ü Record matching and its interaction with data repairing ü Certain fixes 3

Profiling: Discovering conditional dependencies Where do dependencies (data quality rules) come from? ü Manual design: domain knowledge analysis Expens ive ü Business rules Inadeq ü Discovery uate Ø Input: sample data D Ø Output: a cover of conditional dependencies that hold on D sample Profiling quality rules Several effective algorithms for discovering conditional dependencies, developed by researchers in the UK, Canada and the US (e. g. , AT&T) Automatic discovery of data quality rules 4

Assessing the quality of conditional dependencies Ø Input: a set of conditional functional dependencies Ø Output: a maximum satisfiable subset of dependencies MAXSC: are the dependencies Complexity: the MAXSC problem for CFDs is NP-complete discovered “dirty” themselves? Theorem: there is an -approximation algorithm for MAXSC ü there exist constant such that for the subset m found by the algorithm has a bound: card( m) > card(OPT( )) Approximation algorithms ü Efficient: low polynomial time ü Performance guarantee: provable within a small distance Automated methods for reasoning about data quality rules 5

TDD: Research Topics in Distributed Databases Data cleaning ü Discovering data quality rules ü Error detection ü Data repairing ü Record matching and its interaction with data repairing ü Certain fixes 6

Detecting violations of CFDs 1 [CC=44, zip] →[street] 2 [CC=44, AC=131] →[city=Edi] tid name title CC AC phn str city zip t 1 Mike MTS 44 131 1234567 Mayfield t 2 Rick DMTS 44 131 3456789 Chrichton NYC EH 4 8 LE t 3 Phil DMTS 44 131 2909229 Chrichton NYC EH 4 8 LE Edi EH 4 8 LE Input: A set of CFDs, and a database D Output: All tuples in D that violate at least one CFD in Automatically check whether the data is dirty or clean 7

Detecting CFD violations ü Input: a set of CFDs and a database DB ü Output: the set of tuples in DB that violate at least one CFD in Approach: automatically generate SQL queries to find violations Complication 1: consider (R: X Y, Tp), the pattern tableau may be large (recall that each tuple in Tp is in fact a constraint) Goal: the size of the SQL queries is independent of Tp Trick: treat Tp as a data table CINDs can be checked along the same lines 8

Single CFD: step 1 A pair of SQL queries, treating Tp as a data table – Single-tuple violation (pattern matching) – Multi-tuple violations (traditional FDs) (cust(country, area-code, phone street, city, zip), Tp) ü Single-tuple violation: Qc Testing pattern tuples select * from R t, Tp tp where t[country] tp[country] AND t[area-code] tp[area-code] AND t[phone] tp[phone] (t[street] <> tp[street] OR t[city] <> tp[city] OR t[zip] <> tp[zip])) – <>: not matching; – t[A 1] tp[A 1]: (t[A 1] = tp[A 1] OR tp[A 1] = _) 9

Single CFD: step 2 (cust(country, area-code, phone street, city, zip), Tp) ü Multi-tuple violations (the semantics of traditional FDs): Qv select distinct t. country, t. area-code, t. phone from R t, Tp tp where t[country] tp[country] AND t[area-code] tp[area-code] AND t[phone] tp[phone] group by t. country, t. area-code, t. phone having count(distinct street, city, zip) > 1 Tp is treated as a data table The semantics of FDs 10

Multiple CFDs Complication 2: if the set has n CFDs, do we use 2 n SQL queries, and thus 2 n passes of the database DB? Goal: ü 2 SQL queries no matter how many CFDs are in ü the size of the SQL queries is independent of Tp Trick: merge multiple CFDs into one ü Given (R: X 1 Y 1, Tp 1), (R: X 2 Y 2, Tp 2) ü Create a single pattern table: Tm = X 1 X 2 Y 1 Y 2, ü Introduce @, a don’t-care variable, to populate attributes of pattern tuples in X 1 – X 2, etc (tp[A] = @) ü Modify the pair of SQL queries by using Tm 11

Handling multiple CFDs CFD 1: (area state, T 1) CFD 2: (zip state, T 2) zip 07974 90291 01202 state NJ CA _ area _ 212 CFDM: (area, zip state, TM) state _ NY CFD 3: (area, zip state, T 3) area 480 310 zip 95120 90995 state CA CA CFD 2: CFD 1: CFD 3: area @ @ @ _ 212 480 310 zip 07974 90291 01202 @ @ 95120 90995 state NJ CA _ _ NY CA CA Qc: select * from R t, TM tp where t[area] ≍ tp[area] AND t[zip] ≍ tp[zip] AND t[state] <> tp[state] Qv: select distinct area, zip from Macro group by area, zip having count(distinct state) > 1 Don’t care Macro: select (case tp[area] when “@” then “@” else t[area] end) as area. . . from R t, TM tp 12 where t[area] ≍ tp[area] AND t[zip] ≍ tp[zip] AND tp[state] =_

Detecting errors in horizontally partitioned data distributed data 1 [CC=44, zip] →[street] 2 [CC=44, AC=131] →[city=Edi] tid name title CC AC phn str t 1 Mike MTS 44 131 1234567 Mayfield Partiti name t 2 t 3 title zip NYC EH 4 8 LE oned b Horizontal partition tid city y title CC AC phn str city Rick DMTS 44 131 3456789 Chrichton NYC EH 4 8 LE Phil DMTS 44 131 2909229 Chrichton Edi zip EH 4 8 LE To find violations of 1, it is necessary to ship data (part of t 1 or t 2) from one site to another 13

Detecting errors in vertically partitioned data 1 [CC=44, zip] →[street] 2 [CC=44, AC=131] →[city=Edi] tid name title CC AC phn str city zip t 1 Mike MTS 44 131 1234567 Mayfield t 2 Rick DMTS 44 131 3456789 Chrichton NYC EH 4 8 LE t 3 Phil DMTS 44 131 2909229 Chrichton NYC EH 4 8 LE Edi EH 4 8 LE vertical partition tid name title CC AC phn tid t 1 131 1234567 t 1 t 2 Mikephon MTS 44 e Rick DMTS 44 131 3456789 t 2 Mayfield NYC EH 4 8 LE add res Chrichton NYC EH 4 8 LE s t 3 Phil 131 2909229 t 3 Chrichton DMTS 44 str city Edi To find violations of 2, it is necessary to ship data zip EH 4 8 LE 14

Error detection in distributed data The error detection problem for CFDs is NP-complete for ü horizontally partitioned data, and ü vertically partitioned data, dist ribu ted data when either minimum data shipment or minimum response time is concerned. In contrast, error detection in centralized data is trivial Heuristic (approximation) algorithms Read: Detecting Inconsistencies in Distributed Data Error detection in distributed data is far more intriguing than its centralized counterpart 15

TDD: Research Topics in Distributed Databases Data cleaning ü Discovering data quality rules ü Error detection ü Data repairing ü Record matching and its interaction with data repairing ü Certain fixes 16

Data repairing: Fixing the errors identified Ø Input: a set of conditional dependencies, and a database DB Ø Output: a candidate repair DB’ such that DB’ satisfies , and cost(DB’, DB) is maximal How to define? repairing Dependencies Accuracy Model The most challenging part of data cleaning 17

Example: networking service provider ü Assume the billing and maintenance departments have separate databases. – Internally consistent, – yet containing errors. ü Goal: reconcile and improve data quality of, for example, integrated customer and billing data. Cust Equip Maintenance Equip Cust. Sites Devices Billing 18

Service provider example, continued. cust t 0 t 1 t 2 t 3 t 4 phno 949 -1212 555 -8145 name Alice Smith Bob Jones street 17 bridge 5 valley rd city midville centre state az ny zip 05211 10012 555 -8195 212 -6040 949 -1212 Bob Jones Carol Blake Ali Stith 5 valley rd 9 mountain 27 bridge centre davis midville nj ca az 10012 07912 05211 equip phno serno eqmfct t 5 949 -1212 AC 13006 AC t 6 555 -8145 L 55001 LU t 7 555 -8195 L 55011 LU t 8 555 -8195 AC 22350 AC t 9 949 -2212 L 32400 LU Billing Dept. eqmodel instdate XE 5000 Mar-02 ze 400 Jan-03 ze 400 Mar-03 XE 5000 Feb-99 ze 300 Oct-01 19 Maintenance Dept.

Constraints and Violations (1) ü Consider inclusion and functional dependencies, (so each x is a violation of one or the other). x xx x ü A functional dependency (FD) says that the values of some fields determine the values of others. t 1 and t 2 violate cust[zip] -> cust[state] cust[phno] -> cust[name, street, city, state, zip] cust t 0 t 1 t 2 t 3 t 4 phno 949 -1212 555 -8145 555 -8195 212 -6040 949 -1212 name Alice Smith Bob Jones Carol Blake Ali Stith street 17 bridge 5 valley rd 9 mountain 27 bridge city midville centre davis midville state az ny nj ca az zip 05211 10012 07912 05211 20

Constraints and Violations (2) An inclusion dependency (IND) says that the values of some fields should appear in some others. t 9 violates equip[phno] cust[phno] equip phno t 9 949 -2212 serno L 32400 eqmfct LU eqmodel ze 300 instdate Oct-01 cust phno t 0 949 -1212 t 1 555 -8145 t 2 555 -8195 t 3 212 -6040 t 4 949 -1212 name Alice Smith Bob Jones Carol Blake Ali Stith street 17 bridge 5 valley rd 9 mountain 27 bridge city midville centre davis midville state az ny nj ca az zip 05211 10012 07912 05211 21

Constraint Repair ü Find a “repair” to suggest ü A repair is a database which does not violate constraints – A good repair is also similar to the original database Constraint 1: FD: R[A] -> R[B, C] “Bad” repair: R: A B C a 1 a 2 b 1 c 2 c 1 a 2 b 1 c 2 A B C "Good" repair: A B C a 1 a 2 b 1 c 2 c 1 a 2 b 1 c 1 22

Problem Statement Input: a relational database D, and a set C of integrity constraints (FDs, INDs) Question: find a “good” repair D’ of D ü repair: D’ satisfies C ü “good”: D’ is “close” to the original data in D – changes are minimal: what metrics (referred to as cost) should we use? – changes: value modification, tuple insertion – to avoid loss of information (tuple deletion can be expressed via modification) We want to compute D’ efficiently, as suggested fix to the users 23

Repair Model ü For the duration of constraint repair, each input tuple is uniquely identified, t 1, t 2, … ü This value of attribute A of tuple t in the input database is D(t, A) ü The output of the algorithm is a repaired database, D', so D'(t, A) is the value of t. A in the current proposed repair. ü For example, D = D' below, except D(t 3, C) <> D'(t 3, C). D: A t 1 t 2 t 3 D': B C a 1 a 2 b 1 c 2 c 1 a 2 b 1 c 2 A t 1 t 2 t 3 B C a 1 a 2 b 1 c 2 c 1 a 2 b 1 c 1 24

Cost Model Components Distance Distant: Weight "Smith" "Jones" "a" "b" "949 -2212" "949 -1212" Close: Confidence placed by the user in the accuracy of a tuple or attribute. "1. 99" "2. 0" "GM" "General Motors" dist(v, v') : [0, 1) weight(t. A) 25

Intuition: Tuple/Attribute Weights ü Simple model of data reliability – Example: Billing department may be more reliable about address information, weight = 2, but less about equipment, weight = 1 – Example 2: Download table of zip codes from post-office web site, weight = 100 ü In the absence of other information, all weights default to 1. cust t 0 t 1 t 2 t 3 t 4 phno 949 -1212 555 -8145 555 -8195 212 -6040 949 -1212 name Alice Smith Bob Jones Carol Blake Ali Stith street 17 bridge 5 valley rd 9 mountain 27 bridge city midville centre davis midville state az ny nj ca az zip 05211 10012 07912 05211 2 2 1 1 1 26

Attribute-Level Cost Model ü If D(t. A) = v and D'(t. A) = v', then * Cost(t. A) = dist(v, v') * weight(t. A) Cost(D’): the sum of Cost(t. A) for all changed tuples t and attributes A ü Example: (if we model delete as dist(x, null)) R: A Repair: cost = 9 B C A B C a 1 a 2 b 1 c 2 c 1 +1 +1 +1 a 2 b 1 c 2 +1 +1 +1 Repair: cost = 1/2 A B C a 1 a 2 b 1 c 2 c 1 a 2 b 1 c 1 +1/2 27

Problem and complexity Input: a relational database D, and a set C of integrity constraints Question: find a repair D’ of D such that cost(D’) is minimal Complexity: Finding an optimal repair is NP-complete in size of database ü Intractable even for a small, constant number of FDs alone. ü Intractable even for a small, constant number of INDs alone ü By contrast, in delete-only model, repair with either INDs or FDs alone is in PTime, while repair with both is Co. NP hard Data complexity What should we do? 28

Heuristic approach to value-based repair ü In light of intractability, we turn to heuristic approaches. However, most have problems. ü Straightforward constraint-by-constraint repair algorithms fail to terminate (fixing individual constraints one by one) consider R 1(A, B), R 2(B, C), with – FD: R 1[A] R 1[B] – IND: R 2[B] R 1[B] – D(R 1): {(1, 2), (1, 3)}, D(R 2) = {(2, 1), (3, 4)} ü Also, user must be involved since any single decision can be wrong. ü One approach: Equivalence-Class-Based Repair. 29

Equivalence Classes cust t 0 t 1 t 2 t 3 t 4 phno 949 -1212 555 -8145 555 -8195 212 -6040 949 -1212 name Alice Smith Bob Jones Carol Blake Ali Stith street 17 bridge 5 valley rd 9 mountain 27 bridge city midville centre davis midville state az ny nj ca az zip 05211 10012 07912 05211 eq 1 ü An equivalence class is a set of "cells" defined by a tuple t and an attribute A. ü An equivalence class eq has a target value targ(eq) drawn from eq -- to be assigned after all the equivalence classes are found ü For example, targ(EQ 1) = "Alice Smith" or "Ali Stith" 30

Equivalence Classes Continued ü In the repair, give each member of the equivalence class the same target value: that is for all (t, A) in eq, D'(t, A) = targ(eq) ü Target value is chosen from the set of values associated with eq in the input: {D(ti, Ai)} ü A given eq class has an easily computed cost w. r. t. a particular target value. For example, if weights are all 1, we might have, – Cost({"A", "B"}, "A") = dist("B", "A") = 1 – Cost({"A", "B"}, "B") = 2*dist("A", "B") = 2 Separate ü The decision of which attribute values need to be equivalent ü The decision of exactly what value targ(eq) should be assigned 31

Resolving FD violations ü To resolve a tuple t violating FD R[A]->R[B], violations, we compute the set of tuples V from R that match t on A, and union equivalence classes of the attributes in B among all tuples in V – changing RHS – Changing the left-hand side to an existing value may not resolve the Why? violation – Changing left-hand side to a new value is arbitrary – loss of e. g. , keys FD: cust[phno] -> cust[name, street, city, state, zip] t 0 t 1 t 2 t 3 t 4 phno 949 -1212 555 -8145 555 -8195 212 -6040 949 -1212 name Alice Smith Bob Jones Carol Blake Ali Stith EQ 1 street 17 bridge 5 valley rd 9 mountain 27 bridge EQ 2 city midville centre davis midville . . . state az ny nj ca az zip 05211 10012 07912 05211 32

Resolving IND violations ü To resolve a tuple t violating IND R[A] S[B], we pick a tuple s from S and union equivalence classes between t. A and s. B for attributes A in A and B in B. equip[phno] cust[phno] equip EQ 1 phno t 9 949 -2212 serno L 32400 eqmfct LU eqmodel ze 300 instdate Oct-01 cust t 0 t 3 t 4 phno 949 -1212 212 -6040 949 -1212 name Alice Smith. . . Carol Blake Ali Stith street 17 bridge city midville state az zip 05211 9 mountain 27 bridge davis midville ca az 07912 05211 How to repair data using CFDs? CINDs? 33

TDD: Research Topics in Distributed Databases Data cleaning ü Discovering data quality rules ü Error detection ü Data repairing ü Record matching and its interaction with data repairing Chapter 6, Foundations of Data Quality Management ü Certain fixes 34

Record matching Ø Input: large, unreliable data sources Ø Output: tuples that refer to the same real-world entity Record matching Matching dependencies matching repairing matching Read: Interaction between Record Matching and Data Repairing 35

Error detection and data enrichment via matching 1. card[LN, address] = trans[LN, post] card[FN] trans[FN] card[X] trans[Y] 2. card[LN, tel] = trans[LN, phn] card[FN] trans[FN] card[X] trans[Y] 3. card[tel] = trans[phn] card[address] trans[post] FN Mark LN address tel Smith 10 Oak St, EDI, EH 8 9 LE FN LN M. Smith … … Max Smith 1 DOB gender 3256777 10/27/97 M post phn when 10 Oak St, EDI, EH 8 9 LE null 1 pm/7/7/09 Match … PO Box 25, EDI 2 … enrich … 3256777 2 pm/7/7/09 inconsistent where amount EDI $3, 500 … … NYC $6, 300 There is interaction between data repairing and record matching 36

Unifying repairing and matching Input data FN LN St City AC post phn item (tran) Bob Robert Brady 5 Wren St Edi 020 WC 1 H 9 SE 3887834 watch Ldn 3887644 Robert Brady Null 5 Wren St Ldn FN LN St 020 WC 1 E 7 HX 3887644 necklace City matching Robert Brady 5 Wren St Ldn AC Zip tel 020 WC 1 H 9 SE 3887644 Master data (card) repairing 1. tran([AC = 020] -> [city = Ldn]) repairing helps matching helps repairing 2. tran([FN = Bob] -> [FN = Robert]) 3. tran[LN, city, St, post] = card[LN, city, St, zip] ˄ tran[FN] ≈ card[FN] -> tran[FN, phn] card[FN, tel] 4. tran([city, phn] -> [St, AC, post]) Repairing and matching operations should be interleaved 37

TDD: Research Topics in Distributed Databases Data cleaning ü Data cleaning: An overview ü Error detection ü Data repairing ü Record matching and its interaction with data repairing ü Certain fixes 38

How to fix the errors detected? Heuristic methods may not ﬁx the 1. [AC=020] →[city=Ldn] 2. [AC=131] →[city=Edi] erroneous t[AC], and worse still, may mess up the correct attribute t[city] FN LN AC phn type str city zip item Bob Brady 020 079172485 2 2 501 Elm St. Ldn EH 7 4 AH CD Bob Brady 020 079172485 020 501 Elm St. Edi Edi EH 7 4 AH CD Bob Brady 131 020 079172485 2 501 Elm St. Edi EH 7 4 AH CD Dependencies can detect errors, but do not tell us how to fix them 39

The quest for certain fixes ü Certain fixes: 100% correct. The need for this is evident when repairing critical data – Every update guarantees to fix an error; – The repairing process does not introduce new errors. ü Editing rules instead of dependencies – Editing rules tell us which attributes to change and how to change them; dynamic semantics Dependencies have static semantics: violation or not – Editing rules are defined on a master relation and an input relation – correcting errors with master data values Dependencies are only defined on input relations Editing rules are a departure from data dependencies 40

1 – home phone Editing rules and master data 2 – mobile phone certain type=2 FN t 1 LN AC phn type str city zip item Robert Bob Brady 131 020 079172485 2 501 Elm Row 501 Elm St. Edi EH 7 4 AH CD Input relation R FN s 1 Robert s 2 Mark LN AC Hphn Mphn str city Brady 131 6884563 079172485 501 Elm Row Edi zip DOB gender EH 7 4 AH 11/11/55 M Smiths 020 6884563 075568485 20 Baker St. Ldn NW 1 6 XE 25/12/67 M Master relation Dm Editing rules: Master data is a single repository of high-quality data that φ1: ((zip, zip) → (AC, str, city), tp 1 = ( )) provides various applications with a synchronized, Certain Mphn) → (FN, LN), tp 2[type] = (2)) φ2: ((phn, regions: validated either by users or inference consistent view of its core business entities. Repairing: interact with users Applying editing rules does not introduce new errors 41

How do we find certain fixes? …… …… t Data Monitoring input stream far less costly to correct a tuple at the point of data entry than fixing it afterward. 42

How do we find certain fixes? …… Data Monitoring Master Data FN …… t LN AC Hphn Mphn str city Robert Brady 131 6884563 079172485 501 Elm Row Edi Mark zip DOB gender EH 7 4 AH 11/11/55 M Smiths 020 6884563 075568485 20 Baker St. Ldn NW 1 6 XE 25/12/67 M Master relation Dm 43

How do we find certain fixes? …… t Master Data …… Data Monitoring Editing Rules Σ 44

1 Editing rules – home phone 2 – mobile phone certain type=2 FN t 1 LN AC phn type str city zip item Robert Bob Brady 131 020 079172485 2 501 Elm St. 501 Elm Row Edi EH 7 4 AH CD Input relation R FN LN AC Hphn Mphn str city zip DOB s 1 Robert Brady 131 6884563 079172485 501 Elm Row Edi s 2 gender Smiths 020 6884563 075568485 20 Baker St. Ldn NW 1 6 XE 25/12/67 Mark EH 7 4 AH 11/11/55 M M • φ1: ((zip, zip) → (AC, str, city), tp 1 = ( )) Master relation • φ2: ((phn, Mphn) → Dm LN), tp 2[type] = (2)) (FN, Read: Towards certain fixes with editing rules and master data Certain fixes: editing rules, master data, certain region 45

Summary and review ü What is data cleaning? Why bother? ü What are the main approaches to cleaning data? Pros and cons? ü Given a database D and a set C of FDs and INDs, can you always find a repair D’ of D? ü What is the complexity of detecting errors in distributed data? ü What is a certain fix? Why ü What is the complexity for constraint-based data cleaning? ü How to repair the data when the problem is NP-hard 46

Projects (1) Develop an algorithm for error detection, when S consists of conditional functional dependencies, and when D is a relation that is either • vertically partitioned; or • horizontally partitioned; and is distributed. ü Prove the correctness of your algorithms, give their complexity analyses, and show that they are scalable with both D and S ü Develop optimization techniques to minimize data shipment ü Experimentally evaluate your algorithms, by randomly generating dependencies in S and large datasets D ü Development projects 47

Projects (2) Develop an incremental algorithm for error detection, in the same setting as projects (2) described earlier. For incremental error detection, read ü W. Fan, J. LI, N. Tang, and W. Yu. Incremental detection of inconsistencies in distributed data, TKDE 26(6). 2014 Prove the correctness of your algorithms, give their complexity analyses, and show that they are scalable with both D and S ü Develop optimization techniques to minimize data shipment ü Experimentally evaluate your algorithms, by randomly generating dependencies in S and large datasets D ü Development projects 48

Projects (3) Write a survey on any of the following topics: • Deducing true values of entities. Data cleaning systems beyond ETL. • Conflict resolution. Pick one of these • XML data cleaning. Identify 5 -6 representative techniques on the topic you pick Develop a set of criteria for evaluating techniques in the line of research you pick. Critically evaluate each of these techniques based on your criteria Propose possible extensions to the techniques as data for big data. • ü ü Survey projects 49