In the framework of daap
, git
enables joint versioning of data and code. Additionally git
provides deep coverage for change tracking code. However, when it comes to data, our ability to track change as at only at a very high/metadata level. While, git
can track changes in text, text change tracking is not suitable for tracking data changes. This leaves a gap in our ability to diff
.
The illustration below shows a possible step-wise flow to implementing a diff functionality. Output of a diff
call generated by this or similar workflow, combines data validation type comparison as well as information content type comparison to provide more informative,interpretable and ultimately actionable summary of the differences.
The are two components in the outlined workflow:
Form or base content diff is mechanical and quite straight forward to implement. In contrast, information content diff
ventures in the realm of unsupervised learning and hence more complex.
diff
Here, I implemented the data validation type diff
and added a minimalist venture into the info content diff
using a simple PCA approach for demonstration purposes.
First we simulate some mock data. This is to help us test a few different simple conditions:
<- d_test()
d ::FromListSimple(d) data.tree
## levelName
## 1 Root
## 2 ¦--order
## 3 ¦ ¦--old
## 4 ¦ °--new
## 5 ¦--attrs
## 6 ¦ ¦--same
## 7 ¦ ¦ ¦--old
## 8 ¦ ¦ °--new
## 9 ¦ °--different
## 10 ¦ ¦--col_same
## 11 ¦ ¦ ¦--old
## 12 ¦ ¦ °--new
## 13 ¦ °--col_diff
## 14 ¦ ¦--old
## 15 ¦ °--new
## 16 ¦--identical
## 17 ¦ ¦--old
## 18 ¦ °--new
## 19 °--records_added
## 20 ¦--old
## 21 °--new
<- ddiff(d_new = d$identical$new, d_old = d$identical$old)
ddiff_rpt -7] %>%
ddiff_rpt[as.data.frame() %>%
::glimpse() dplyr
## Rows: 1
## Columns: 6
## $ comparable <lgl> TRUE
## $ d_match <lgl> TRUE
## $ attrnames_match <lgl> TRUE
## $ equivalent <lgl> TRUE
## $ identical <lgl> TRUE
## $ message <chr> "Data tables are identical"
message: Data tables are identical
Here, simply the rows and column orders are randomly changed. The resultant tables while not identical are equivalent from data perspective.
$order$old %>%
dhead() %>%
::kable(.) knitr
nm_1 | nm_2 | nm_3 | cat_1 | bin_1 | id |
---|---|---|---|---|---|
-1.1894537 | 1.4234235 | 0.1872768 | c | Y | id_1 |
0.3885812 | -1.0426581 | 1.9137194 | b | Y | id_2 |
-0.3443333 | 0.0011692 | -0.6226594 | c | Y | id_3 |
-0.5478961 | 1.0904552 | -1.0641839 | c | N | id_4 |
0.9806622 | -0.9987152 | -0.3422707 | b | N | id_5 |
-0.2366460 | 0.5348300 | -0.1013222 | b | N | id_6 |
$order$new %>%
dhead() %>%
::kable(.) knitr
nm_2 | cat_1 | bin_1 | id | nm_3 | nm_1 | |
---|---|---|---|---|---|---|
20 | -0.9688690 | c | Y | id_20 | 0.2379574 | -0.7002809 |
10 | 0.2551949 | a | N | id_10 | 0.1783597 | -0.1830838 |
16 | -1.8266932 | b | N | id_16 | -0.2162799 | -0.2605846 |
11 | 1.5047414 | c | Y | id_11 | 1.7036697 | 0.5186300 |
19 | -0.5903992 | c | N | id_19 | -1.1511721 | 0.5042705 |
18 | -2.3685212 | a | N | id_18 | 0.7847898 | 1.4065584 |
<- ddiff(d_new = d$order$new, d_old = d$order$old)
ddiff_rpt -7] %>%
ddiff_rpt[as.data.frame() %>%
::glimpse() dplyr
## Rows: 1
## Columns: 6
## $ comparable <lgl> TRUE
## $ d_match <lgl> TRUE
## $ attrnames_match <lgl> TRUE
## $ equivalent <lgl> TRUE
## $ identical <lgl> FALSE
## $ message <chr> "Data table contents and attr names match: differences…
message: Data table contents and attr names match: differences are limited to row or col orders or other attribute values
Here, all table attributes are the same, just row content is changed.
<- ddiff(d_new = d$attrs$same$new, d_old = d$attrs$same$old) ddiff_rpt
## Warning in compareDF::compare_df(df_new = d_new0, df_old = d_old0): Missing
## grouping columns. Adding rownames to use as the default
-7] %>%
ddiff_rpt[as.data.frame() %>%
::glimpse() dplyr
## Rows: 1
## Columns: 6
## $ comparable <lgl> TRUE
## $ d_match <lgl> FALSE
## $ attrnames_match <lgl> TRUE
## $ equivalent <lgl> FALSE
## $ identical <lgl> FALSE
## $ message <chr> "See comp_obj for detail comparison\n"
::create_output_table(ddiff_rpt$comp_obj) compareDF
rowname | chng_type | nm_1 | nm_2 | nm_3 | cat_1 | bin_1 | id |
---|---|---|---|---|---|---|---|
1 | + | 2.458 | 1.236 | 0.141 | b | N | id_01 |
1 | - | -1.189 | 1.423 | 0.187 | c | Y | id_1 |
message: See comp_obj for detail comparison
Here, there is a single attribute added, but otherwise the contents are the same.
<- ddiff(d_new = d$attrs$different$col_same$new, d_old = d$attrs$different$col_same$old)
ddiff_rpt -7] %>%
ddiff_rpt[as.data.frame() %>%
::glimpse() dplyr
## Rows: 1
## Columns: 6
## $ comparable <lgl> TRUE
## $ d_match <lgl> TRUE
## $ attrnames_match <lgl> FALSE
## $ equivalent <lgl> FALSE
## $ identical <lgl> FALSE
## $ message <chr> "Data table contents match but attr names may differ\n…
message: Data table contents match but attr names may differ Attributes added key_1
Here, a new column is added. As comparability assumes we are comparing the same set of features, it alerts the developer to the issue, leaving it to her take the right course of action.
<- ddiff(d_new = d$attrs$different$col_diff$new, d_old = d$attrs$different$col_diff$old)
ddiff_rpt -7] %>%
ddiff_rpt[as.data.frame() %>%
::glimpse() dplyr
## Rows: 1
## Columns: 6
## $ comparable <lgl> FALSE
## $ d_match <lgl> FALSE
## $ attrnames_match <lgl> FALSE
## $ equivalent <lgl> FALSE
## $ identical <lgl> FALSE
## $ message <chr> "Colnames are different. Ensure d_new and d_old being …
message: Colnames are different. Ensure d_new and d_old being compared have the same colnames
Here 20 new records are added. All other aspects of the data table attributes are the same. compareDF
does provide git
like comparison, showing added records. However, you can imagine with larger number of rows and columns added, the value of visual display of +/- is minimal. Additionally, trends can’t be spotted. Can you see a difference between rows 21-30 vs. 31-40?
<- ddiff(d_new = d$records_added$new, d_old = d$records_added$old) ddiff_rpt
## Warning in compareDF::compare_df(df_new = d_new0, df_old = d_old0): Missing
## grouping columns. Adding rownames to use as the default
-7] %>%
ddiff_rpt[as.data.frame() %>%
::glimpse() dplyr
## Rows: 1
## Columns: 6
## $ comparable <lgl> TRUE
## $ d_match <lgl> FALSE
## $ attrnames_match <lgl> TRUE
## $ equivalent <lgl> FALSE
## $ identical <lgl> FALSE
## $ message <chr> "See comp_obj for detail comparison\n"
::create_output_table(ddiff_rpt$comp_obj) compareDF
rowname | chng_type | nm_1 | nm_2 | nm_3 | cat_1 | bin_1 | id |
---|---|---|---|---|---|---|---|
21 | + | 0.615 | -0.858 | -0.156 | a | Y | id_21 |
22 | + | -0.6 | 0.908 | -0.877 | c | Y | id_22 |
23 | + | -1.369 | 1.423 | -0.34 | a | Y | id_23 |
24 | + | -0.195 | -1.487 | -0.711 | b | Y | id_24 |
25 | + | 1.459 | -0.191 | 0.799 | b | Y | id_25 |
26 | + | -0.983 | -0.939 | -2.284 | c | N | id_26 |
27 | + | -0.779 | 0.86 | -0.1 | c | N | id_27 |
28 | + | 1.776 | -1.516 | 0.711 | b | N | id_28 |
29 | + | -1.266 | 0.788 | 0.014 | a | N | id_29 |
30 | + | 1.302 | 0.071 | -0.721 | c | N | id_30 |
31 | + | 5.342 | 3.155 | 4.448 | b | Y | id_31 |
32 | + | 5.672 | 5.122 | 6.219 | a | Y | id_32 |
33 | + | 5.671 | 5.951 | 5.21 | b | N | id_33 |
34 | + | 2.906 | 5.784 | 4.47 | a | Y | id_34 |
35 | + | 6.542 | 6.426 | 4.623 | b | Y | id_35 |
36 | + | 4.881 | 4.425 | 6.066 | a | Y | id_36 |
37 | + | 3.891 | 2.062 | 4.676 | b | Y | id_37 |
38 | + | 4.703 | 5.766 | 6.112 | a | N | id_38 |
39 | + | 3.049 | 4.464 | 4.925 | b | N | id_39 |
40 | + | 4.229 | 6.971 | 3.687 | b | N | id_40 |
message: See comp_obj for detail comparison
diff
diff
Here we look at the same 20 records added using simple PCA analysis.
prcomp(x = d$records_added$old[, c("nm_1", "nm_2", "nm_3")], retx = T,
center = T, scale. = T) -> pc_old
<- scale(d$records_added$new[-(1:20), c("nm_1", "nm_2", "nm_3")],
newrecords_sc center = pc_old$center, scale = pc_old$scale)
<- newrecords_sc %*% pc_old$rotation
new_records_hat <- rbind(
all_d data.frame(source = "old", pc_old$x),
data.frame(source = "new", new_records_hat)
)
ggplot(all_d, aes(x = PC1, y = PC2)) +
geom_point(aes(colour = source)) +
::scale_color_jama() +
ggscigeom_density_2d() +
theme_bw() +
ggtitle("New samples projected on old PC 1,2 space")
diff
Univariate approach can be considered the “lowest hanging fruit” in implementing information diff. On binary and categorical front, one could test conformity to the existing categories. On continuous variables, both rank-based and parametric approaches can easily be leveraged.
diff
Multivariate approach to information content comparison has the most potential for enabling data developers to make actionable comparisons between data-sets that are being compared. However, the potential may not easily or fully be realized given the degree of complexity involved. Among the multivariate approaches, one may dichotomize the problem into those involving plain continuous data where the generative model can be reasonably approximated by a multivariate distribution (i.e. not censored or repeated measure such as time series) and everything else.
diff
for plain continuous dataApproaches such as matrix factorization, or distance-based anomaly detection could be leveraged. An example would be MSD (Modified Stahel-Donoho) estimator
diff
for anything but plain continuous dataHere diff
is likely to require information to be supplied by the user to the algorithm for optimal results. For example, if time course structure is known, one may be able to make projections or interpolation and evaluate how likely or unlikely new data is compared with the old data.
[1] Wada, Kazumi, and Hiroe Tsubaki. “Parallel computation of modified Stahel-Donoho estimators for multivariate outlier detection.” 2013 International Conference on Cloud Computing and Big Data. IEEE, 2013.