Version SHAs: Data SHA vs. Metadata SHA
Source:vignettes/design-version-shas.Rmd
design-version-shas.RmdCompanion to: Month 2 Arrives. Read this when you’ve noticed that
datom_history()shows aversioncolumn and adata_shacolumn, and you want to know why there are two.
datom uses two SHAs to identify a table version:
-
data_sha– a hash of the parquet bytes. Identifies the data. -
metadata_sha– a hash of the metadata describing the data. Identifies the version.
This article explains why one SHA isn’t enough, and the small set of implementation details that keep both stable across machines, sessions, and JSON round-trips.
Why two SHAs
Imagine you write the same dm data frame on two
different days, with two different commit messages and two different
parents lists (because you regenerated it from different
upstream sources). The bytes are identical. Are these “the same
version”?
datom’s answer:
-
data_shais the same – the parquet content is byte-identical. -
metadata_shais different – the surrounding context (parents, message, table type) is different.
So the same data can appear under multiple versions, but each version is unique. This matters because:
-
Deduplication. Two versions with the same
data_shashare one parquet file in the store. You don’t pay storage cost twice. -
Provenance. A read pinned by
version(metadata_sha) gets you not just the bytes but the lineage that produced them. Two analyses that used the “same data via different upstream paths” can be told apart. -
Change detection.
datom_write()checks both. Ifdata_shaandmetadata_shaare unchanged, it’s a no-op. If onlymetadata_shachanged (e.g., new parents), datom records a new version without re-uploading the parquet.
When you call datom_read(conn, "dm", version = ...), the
version argument is the metadata_sha. The
lookup goes metadata_sha -> metadata entry ->
data_sha -> parquet file.
Computing data_sha
The data SHA hashes the parquet representation, not the in-memory data frame. Concretely (from R/utils-sha.R):
arrow::write_parquet(prepared, tmp)
digest::digest(file = tmp, algo = "sha256")A few properties this gives us:
- Cross-language stability. Any language that produces the same parquet bytes produces the same SHA. This is what makes datom usable from R today and Python tomorrow without coordination.
-
Column and row order matter by default. A reordered
data frame is a different table. There are
sort_columnsandsort_rowsarguments for callers who want order-independent identity, but the default is “what you wrote is what you get.” - Empty data frames are rejected. Computing a SHA over zero rows or zero columns is almost always a bug; datom errors instead.
Computing metadata_sha
The metadata SHA is more careful. It has to satisfy three properties that don’t fall out automatically:
-
JSON round-trip stability. A reader pulling
metadata from S3 gets a JSON-parsed list, not the original R object. R’s
serialize()is type-sensitive (10Land10serialize differently); JSON-parsed data drops that distinction. If we hashed the R object directly, the same metadata would have different SHAs in-memory vs. round-tripped. - Field-order independence. Two writers building metadata in different orders should produce the same SHA.
- Volatile-field exclusion. Two writes of identical content at different times must hash to the same value, so version dedup works.
The implementation handles all three:
.datom_compute_metadata_sha <- function(metadata) {
# 1. Strip volatile fields that don't define content identity
volatile <- c("created_at", "datom_version")
semantic <- metadata[setdiff(names(metadata), volatile)]
# 2. Sort field names for order independence
sorted_metadata <- semantic[sort(names(semantic))]
# 3. Hash the JSON canonical form, not the R object
canonical <- jsonlite::toJSON(sorted_metadata, auto_unbox = TRUE)
digest::digest(canonical, algo = "sha256", serialize = FALSE)
}The serialize = FALSE is the load-bearing detail. It
tells digest() to hash the bytes of the JSON string, not an
R-serialized representation. That’s what makes the SHA portable across
the git -> S3 -> reader path.
The volatile-field list
Two fields are explicitly excluded from the SHA:
-
created_at– a wall-clock timestamp. -
datom_version– the version of the datom package that wrote the metadata.
If either of these participated in the SHA, every re-write of
identical data would record a “new” version, and
version_history.json would grow forever. The volatile list
is short and tightly scoped on purpose; adding a field to it is the kind
of change that needs explicit justification.
The dedup guard
Even with the volatile list correct,
.datom_write_metadata_local() adds a belt-and-suspenders
check: if the most recent entry in version_history.json
already has the same metadata_sha, the new entry is
not appended. This protects against:
- Bugs in metadata construction that accidentally produce identical SHAs.
- Concurrent writers racing on the same git clone (rare, but the cost of the check is trivial).
- Re-runs of a partially-failed write where some steps succeeded and others didn’t.
The guard relies on metadata_sha correctness. If you
ever add a new field to metadata that affects content identity, make
sure it’s not in the volatile list – otherwise dedup will
incorrectly mark new versions as duplicates.
What you can rely on
- A version is uniquely identified by its
metadata_sha. - The same data under different lineage produces a different
metadata_shabut the samedata_sha. - Identical writes (same data, same metadata) are no-ops at every layer: same parquet filename in the store, no new history entry, no new git commit.
- A reader on a different machine pulling metadata from S3 computes
the same
metadata_shaas the writer.
These four properties are the foundation that everything in Auditing & Reproducibility eventually depends on.