Version-Controlled Data Management for Reproducible Workflows • datom

datom is version control for tabular data. It treats every write as a content-addressed snapshot, keeps the audit trail in git, and stores the bytes wherever you tell it to live – your laptop, S3, or any backend datom learns to speak. Whatever you write today is readable, by SHA, forever.

It was built for clinical and scientific workflows where reproducing a historical analysis is not optional, but nothing in the design is domain-specific.

Why datom

Versioned tables out of the box. Every datom_write() is a new, immutable, hash-named version. Yesterday’s data is still there.
Free duplicate detection. Re-writing a table you have already written costs nothing – datom recognizes the content and skips.
Code in git, data wherever. Metadata is diff-able and reviewable in a normal git workflow. Parquet bytes go to S3, a local directory, or (eventually) other cloud backends.
Two roles, one model. Data engineers write; analysts and downstream pipelines read. Both work against the same versioned source of truth.

Installation

# install.packages("pak")
pak::pak("amashadihossein/datom")

A two-minute tour

What this tour builds: a shared, versioned data space with reproducible reads for multiple engineers and analysts – coordinated through a single git history. Every datom_write() is a commit; every datom_read() resolves to an exact content SHA. No one can silently overwrite history.

Before you start: you need a GitHub personal access token (PAT) scoped to repo. While you don’t need to have this token stored in your OS keychain, use of keyring is recommended for security and ease of use.

keyring::key_set("GITHUB_PAT")          # one-time setup
nzchar(keyring::key_get("GITHUB_PAT"))  # verify -- should return TRUE

See the Credentials in Practice article for a step-by-step walkthrough of PAT creation and keyring setup.

library(datom)
library(fs)

# Two paths, two roles:
#   dev_dir  -- your local workspace for this project (stays on your machine)
#   data_dir -- where the actual data lives; point this at a shared location
#               (network drive, S3, etc.) for team access. Temp dir used here
#               for demonstration -- replace with a real path when you're ready.
dev_dir  <- path(tempdir(), "study_001_dev")
data_dir <- path(tempdir(), "study_001_data")
dir_create(data_dir)

# Build a store (the only credential needed is your GitHub PAT).
data_component <- datom_store_local(path = data_dir)

store <- datom_store(
  governance = NULL,
  data       = data_component,
  github_pat = keyring::key_get("GITHUB_PAT")
)

# Initialize a project: registers it on GitHub and sets up your dev workspace.
datom_init_repo(
  path         = dev_dir,
  project_name = "STUDY_001",
  store        = store,
  create_repo  = TRUE,
  repo_name    = "study-001-data"
)

conn <- datom_get_conn(path = dev_dir, store = store)

# Explore what was created before writing anything.
fs::dir_tree(dev_dir)   # your local workspace
fs::dir_tree(data_dir)  # data storage (empty until first write)

Now write a table – twice – and watch datom do the right thing:

dm <- datom_example_data("dm", cutoff_date = "2026-01-28")

datom_write(conn, data = dm, name = "dm", message = "Initial DM extract")
#> v Wrote "dm" (full): "a8ee7a31"

# Same data again. datom recognizes it and skips.
datom_write(conn, data = dm, name = "dm")
#> i No changes detected for "dm". Skipping write.

datom_list(conn)
#>   name current_version current_data_sha last_updated
#> 1   dm         a8ee7a31         4b6d0a7e 2026-01-28T...

# datom reads back data as a tibble. Use tibble::as_tibble() on the
# original for a clean round-trip comparison.
identical(datom_read(conn, "dm"), tibble::as_tibble(dm))
#> [1] TRUE

Three things just happened that are worth pausing on:

Reproducibility is now built in – every write minted a SHA tied to the data itself. That SHA is how you read back, list, diff, and audit. Same data on any machine returns the same SHA; same SHA always returns the same bytes. datom_read(), datom_list(), and datom_history() are all just different views into the same versioned record.
Idempotent writes – re-writing the same data was a free no-op. Pipelines are safe to re-run without polluting history.
Data as code – the version history is in git: diffable, reviewable, and shareable like any other code asset. The data bytes stayed in data_dir – nothing sensitive went to GitHub. Before you tear down, open https://github.com/<your-username>/study-001-data and look at the commits: you will see the full audit trail with no data bytes in sight.

Teardown

Pick one:

# Option A -- full scripted teardown (deletes local files AND the GitHub repo).
# Do this BEFORE unlink().
datom_decommission(conn, confirm = "STUDY_001")

# Option B -- local only (GitHub repo stays; delete it manually from the UI).
unlink(c(dev_dir, data_dir), recursive = TRUE)

Do not call unlink() before datom_decommission() – removing the local clone first strips the GitHub remote reference and the remote repo will not be deleted.

Where to go next

The Get Started articles follow a single user journey through six months of a clinical study, from a first extract on a laptop through a full multi-engineer, multi-project governance workflow.

When you are ready for	Article
A second monthly extract – versioning in action	Month 2 Arrives
Importing a folder of extracts at once	A Folder of Extracts
Moving from local storage to S3	Promoting to S3
Sharing data with statisticians	Handing Off to a Statistician
Governing a portfolio of studies	Governing a Portfolio

For the design rationale – why two repos, what ref.json does, how SHAs are computed – see the Design articles in the same site.

Where datom fits

datom is the foundational layer for the daapr ecosystem.

Package	Purpose
datom	Version-controlled table storage (this package)
dpbuild	Data product construction
dpdeploy	Deployment orchestration
dpi	Data product access

See dev/datom_specification.md for the full technical specification.