1 Introduction

In the framework of daap, git enables joint versioning of data and code. Additionally git provides deep coverage for change tracking code. However, when it comes to data, our ability to track change as at only at a very high/metadata level. While, git can track changes in text, text change tracking is not suitable for tracking data changes. This leaves a gap in our ability to diff.

3 Method:

The illustration below shows a possible step-wise flow to implementing a diff functionality. ddiff appraoch Output of a diff call generated by this or similar workflow, combines data validation type comparison as well as information content type comparison to provide more informative,interpretable and ultimately actionable summary of the differences.

4 Results (preliminary)

The are two components in the outlined workflow:

  1. Form or base content diff
  2. Information content diff

Form or base content diff is mechanical and quite straight forward to implement. In contrast, information content diff ventures in the realm of unsupervised learning and hence more complex.

4.1 Form and base content diff

Here, I implemented the data validation type diff and added a minimalist venture into the info content diff using a simple PCA approach for demonstration purposes.

First we simulate some mock data. This is to help us test a few different simple conditions:

  1. When data tables being compared are identical
  2. When order of columns and rows are changed
  3. When attributes are the same but content change
  4. When attributes change but colnames are the same
  5. When attributes change and colnames are different too
  6. When new records are added to a table
d <- d_test()
data.tree::FromListSimple(d)
##               levelName
## 1  Root                
## 2   ¦--order           
## 3   ¦   ¦--old         
## 4   ¦   °--new         
## 5   ¦--attrs           
## 6   ¦   ¦--same        
## 7   ¦   ¦   ¦--old     
## 8   ¦   ¦   °--new     
## 9   ¦   °--different   
## 10  ¦       ¦--col_same
## 11  ¦       ¦   ¦--old 
## 12  ¦       ¦   °--new 
## 13  ¦       °--col_diff
## 14  ¦           ¦--old 
## 15  ¦           °--new 
## 16  ¦--identical       
## 17  ¦   ¦--old         
## 18  ¦   °--new         
## 19  °--records_added   
## 20      ¦--old         
## 21      °--new

4.1.1 Identical

ddiff_rpt <- ddiff(d_new = d$identical$new, d_old = d$identical$old)
ddiff_rpt[-7] %>%
  as.data.frame() %>%
  dplyr::glimpse()
## Rows: 1
## Columns: 6
## $ comparable      <lgl> TRUE
## $ d_match         <lgl> TRUE
## $ attrnames_match <lgl> TRUE
## $ equivalent      <lgl> TRUE
## $ identical       <lgl> TRUE
## $ message         <chr> "Data tables are identical"

message: Data tables are identical

4.1.2 Change in order

Here, simply the rows and column orders are randomly changed. The resultant tables while not identical are equivalent from data perspective.

d$order$old %>%
  head() %>%
  knitr::kable(.)
nm_1 nm_2 nm_3 cat_1 bin_1 id
-1.1894537 1.4234235 0.1872768 c Y id_1
0.3885812 -1.0426581 1.9137194 b Y id_2
-0.3443333 0.0011692 -0.6226594 c Y id_3
-0.5478961 1.0904552 -1.0641839 c N id_4
0.9806622 -0.9987152 -0.3422707 b N id_5
-0.2366460 0.5348300 -0.1013222 b N id_6
d$order$new %>%
  head() %>%
  knitr::kable(.)
nm_2 cat_1 bin_1 id nm_3 nm_1
20 -0.9688690 c Y id_20 0.2379574 -0.7002809
10 0.2551949 a N id_10 0.1783597 -0.1830838
16 -1.8266932 b N id_16 -0.2162799 -0.2605846
11 1.5047414 c Y id_11 1.7036697 0.5186300
19 -0.5903992 c N id_19 -1.1511721 0.5042705
18 -2.3685212 a N id_18 0.7847898 1.4065584
ddiff_rpt <- ddiff(d_new = d$order$new, d_old = d$order$old)
ddiff_rpt[-7] %>%
  as.data.frame() %>%
  dplyr::glimpse()
## Rows: 1
## Columns: 6
## $ comparable      <lgl> TRUE
## $ d_match         <lgl> TRUE
## $ attrnames_match <lgl> TRUE
## $ equivalent      <lgl> TRUE
## $ identical       <lgl> FALSE
## $ message         <chr> "Data table contents and attr names match: differences…

message: Data table contents and attr names match: differences are limited to row or col orders or other attribute values

4.1.3 Attributes the same but content change

Here, all table attributes are the same, just row content is changed.

ddiff_rpt <- ddiff(d_new = d$attrs$same$new, d_old = d$attrs$same$old)
## Warning in compareDF::compare_df(df_new = d_new0, df_old = d_old0): Missing
## grouping columns. Adding rownames to use as the default
ddiff_rpt[-7] %>%
  as.data.frame() %>%
  dplyr::glimpse()
## Rows: 1
## Columns: 6
## $ comparable      <lgl> TRUE
## $ d_match         <lgl> FALSE
## $ attrnames_match <lgl> TRUE
## $ equivalent      <lgl> FALSE
## $ identical       <lgl> FALSE
## $ message         <chr> "See comp_obj for detail comparison\n"
compareDF::create_output_table(ddiff_rpt$comp_obj)
rowname chng_type nm_1 nm_2 nm_3 cat_1 bin_1 id
1 + 2.458 1.236 0.141 b N id_01
1 - -1.189 1.423 0.187 c Y id_1

message: See comp_obj for detail comparison

4.1.4 Attributes different colnames same

Here, there is a single attribute added, but otherwise the contents are the same.

ddiff_rpt <- ddiff(d_new = d$attrs$different$col_same$new, d_old = d$attrs$different$col_same$old)
ddiff_rpt[-7] %>%
  as.data.frame() %>%
  dplyr::glimpse()
## Rows: 1
## Columns: 6
## $ comparable      <lgl> TRUE
## $ d_match         <lgl> TRUE
## $ attrnames_match <lgl> FALSE
## $ equivalent      <lgl> FALSE
## $ identical       <lgl> FALSE
## $ message         <chr> "Data table contents match but attr names may differ\n…

message: Data table contents match but attr names may differ Attributes added key_1

4.1.5 Attributes different colnames different

Here, a new column is added. As comparability assumes we are comparing the same set of features, it alerts the developer to the issue, leaving it to her take the right course of action.

ddiff_rpt <- ddiff(d_new = d$attrs$different$col_diff$new, d_old = d$attrs$different$col_diff$old)
ddiff_rpt[-7] %>%
  as.data.frame() %>%
  dplyr::glimpse()
## Rows: 1
## Columns: 6
## $ comparable      <lgl> FALSE
## $ d_match         <lgl> FALSE
## $ attrnames_match <lgl> FALSE
## $ equivalent      <lgl> FALSE
## $ identical       <lgl> FALSE
## $ message         <chr> "Colnames are different. Ensure d_new and d_old being …

message: Colnames are different. Ensure d_new and d_old being compared have the same colnames

4.1.6 New records added

Here 20 new records are added. All other aspects of the data table attributes are the same. compareDF does provide git like comparison, showing added records. However, you can imagine with larger number of rows and columns added, the value of visual display of +/- is minimal. Additionally, trends can’t be spotted. Can you see a difference between rows 21-30 vs. 31-40?

ddiff_rpt <- ddiff(d_new = d$records_added$new, d_old = d$records_added$old)
## Warning in compareDF::compare_df(df_new = d_new0, df_old = d_old0): Missing
## grouping columns. Adding rownames to use as the default
ddiff_rpt[-7] %>%
  as.data.frame() %>%
  dplyr::glimpse()
## Rows: 1
## Columns: 6
## $ comparable      <lgl> TRUE
## $ d_match         <lgl> FALSE
## $ attrnames_match <lgl> TRUE
## $ equivalent      <lgl> FALSE
## $ identical       <lgl> FALSE
## $ message         <chr> "See comp_obj for detail comparison\n"
compareDF::create_output_table(ddiff_rpt$comp_obj)
rowname chng_type nm_1 nm_2 nm_3 cat_1 bin_1 id
21 + 0.615 -0.858 -0.156 a Y id_21
22 + -0.6 0.908 -0.877 c Y id_22
23 + -1.369 1.423 -0.34 a Y id_23
24 + -0.195 -1.487 -0.711 b Y id_24
25 + 1.459 -0.191 0.799 b Y id_25
26 + -0.983 -0.939 -2.284 c N id_26
27 + -0.779 0.86 -0.1 c N id_27
28 + 1.776 -1.516 0.711 b N id_28
29 + -1.266 0.788 0.014 a N id_29
30 + 1.302 0.071 -0.721 c N id_30
31 + 5.342 3.155 4.448 b Y id_31
32 + 5.672 5.122 6.219 a Y id_32
33 + 5.671 5.951 5.21 b N id_33
34 + 2.906 5.784 4.47 a Y id_34
35 + 6.542 6.426 4.623 b Y id_35
36 + 4.881 4.425 6.066 a Y id_36
37 + 3.891 2.062 4.676 b Y id_37
38 + 4.703 5.766 6.112 a N id_38
39 + 3.049 4.464 4.925 b N id_39
40 + 4.229 6.971 3.687 b N id_40

message: See comp_obj for detail comparison

4.2 Information content diff

4.2.1 Example of info content type diff

Here we look at the same 20 records added using simple PCA analysis.

prcomp(x = d$records_added$old[, c("nm_1", "nm_2", "nm_3")], retx = T, 
       center = T, scale. = T) -> pc_old

newrecords_sc <- scale(d$records_added$new[-(1:20), c("nm_1", "nm_2", "nm_3")],
                       center = pc_old$center, scale = pc_old$scale)

new_records_hat <- newrecords_sc %*% pc_old$rotation
all_d <- rbind(
  data.frame(source = "old", pc_old$x),
  data.frame(source = "new", new_records_hat)
)

ggplot(all_d, aes(x = PC1, y = PC2)) +
  geom_point(aes(colour = source)) +
  ggsci::scale_color_jama() +
  geom_density_2d() +
  theme_bw() +
  ggtitle("New samples projected on old PC 1,2 space")

5 Further development

5.1 Univariate information content diff

Univariate approach can be considered the “lowest hanging fruit” in implementing information diff. On binary and categorical front, one could test conformity to the existing categories. On continuous variables, both rank-based and parametric approaches can easily be leveraged.

5.2 Multivariate information content diff

Multivariate approach to information content comparison has the most potential for enabling data developers to make actionable comparisons between data-sets that are being compared. However, the potential may not easily or fully be realized given the degree of complexity involved. Among the multivariate approaches, one may dichotomize the problem into those involving plain continuous data where the generative model can be reasonably approximated by a multivariate distribution (i.e. not censored or repeated measure such as time series) and everything else.

5.2.1 Multivariate diff for plain continuous data

Approaches such as matrix factorization, or distance-based anomaly detection could be leveraged. An example would be MSD (Modified Stahel-Donoho) estimator

5.2.2 Mutlivariate diff for anything but plain continuous data

Here diff is likely to require information to be supplied by the user to the algorithm for optimal results. For example, if time course structure is known, one may be able to make projections or interpolation and evaluate how likely or unlikely new data is compared with the old data.

6 References:

[1] Wada, Kazumi, and Hiroe Tsubaki. “Parallel computation of modified Stahel-Donoho estimators for multivariate outlier detection.” 2013 International Conference on Cloud Computing and Big Data. IEEE, 2013.