datom is version control for tabular data. It treats every write as a content-addressed snapshot, keeps the audit trail in git, and stores the bytes wherever you tell it to live – your laptop, S3, or any backend datom learns to speak. Whatever you write today is readable, by SHA, forever.
It was built for clinical and scientific workflows where reproducing a historical analysis is not optional, but nothing in the design is domain-specific.
Why datom
-
Versioned tables out of the box. Every
datom_write()is a new, immutable, hash-named version. Yesterday’s data is still there. - Free duplicate detection. Re-writing a table you have already written costs nothing – datom recognizes the content and skips.
- Code in git, data wherever. Metadata is diff-able and reviewable in a normal git workflow. Parquet bytes go to S3, a local directory, or (eventually) other cloud backends.
- Two roles, one model. Data engineers write; analysts and downstream pipelines read. Both work against the same versioned source of truth.
Installation
# install.packages("pak")
pak::pak("amashadihossein/datom")A two-minute tour
What this tour builds: a shared, versioned data space with reproducible reads for multiple engineers and analysts – coordinated through a single git history. Every
datom_write()is a commit; everydatom_read()resolves to an exact content SHA. No one can silently overwrite history.
Before you start: you need a GitHub personal access token (PAT) scoped to repo. While you don’t need to have this token stored in your OS keychain, use of keyring is recommended for security and ease of use.
keyring::key_set("GITHUB_PAT") # one-time setup
nzchar(keyring::key_get("GITHUB_PAT")) # verify -- should return TRUESee the Credentials in Practice article for a step-by-step walkthrough of PAT creation and keyring setup.
library(datom)
library(fs)
# Two paths, two roles:
# dev_dir -- your local workspace for this project (stays on your machine)
# data_dir -- where the actual data lives; point this at a shared location
# (network drive, S3, etc.) for team access. Temp dir used here
# for demonstration -- replace with a real path when you're ready.
dev_dir <- path(tempdir(), "study_001_dev")
data_dir <- path(tempdir(), "study_001_data")
dir_create(data_dir)
# Build a store (the only credential needed is your GitHub PAT).
data_component <- datom_store_local(path = data_dir)
store <- datom_store(
governance = NULL,
data = data_component,
github_pat = keyring::key_get("GITHUB_PAT")
)
# Initialize a project: registers it on GitHub and sets up your dev workspace.
datom_init_repo(
path = dev_dir,
project_name = "STUDY_001",
store = store,
create_repo = TRUE,
repo_name = "study-001-data"
)
conn <- datom_get_conn(path = dev_dir, store = store)
# Explore what was created before writing anything.
fs::dir_tree(dev_dir) # your local workspace
fs::dir_tree(data_dir) # data storage (empty until first write)Now write a table – twice – and watch datom do the right thing:
dm <- datom_example_data("dm", cutoff_date = "2026-01-28")
datom_write(conn, data = dm, name = "dm", message = "Initial DM extract")
#> v Wrote "dm" (full): "a8ee7a31"
# Same data again. datom recognizes it and skips.
datom_write(conn, data = dm, name = "dm")
#> i No changes detected for "dm". Skipping write.
datom_list(conn)
#> name current_version current_data_sha last_updated
#> 1 dm a8ee7a31 4b6d0a7e 2026-01-28T...
# datom reads back data as a tibble. Use tibble::as_tibble() on the
# original for a clean round-trip comparison.
identical(datom_read(conn, "dm"), tibble::as_tibble(dm))
#> [1] TRUEThree things just happened that are worth pausing on:
-
Reproducibility is now built in – every write minted a SHA tied to the data itself. That SHA is how you read back, list, diff, and audit. Same data on any machine returns the same SHA; same SHA always returns the same bytes.
datom_read(),datom_list(), anddatom_history()are all just different views into the same versioned record. - Idempotent writes – re-writing the same data was a free no-op. Pipelines are safe to re-run without polluting history.
-
Data as code – the version history is in git: diffable, reviewable, and shareable like any other code asset. The data bytes stayed in
data_dir– nothing sensitive went to GitHub. Before you tear down, openhttps://github.com/<your-username>/study-001-dataand look at the commits: you will see the full audit trail with no data bytes in sight.
Teardown
Pick one:
# Option A -- full scripted teardown (deletes local files AND the GitHub repo).
# Do this BEFORE unlink().
datom_decommission(conn, confirm = "STUDY_001")
# Option B -- local only (GitHub repo stays; delete it manually from the UI).
unlink(c(dev_dir, data_dir), recursive = TRUE)Do not call unlink() before datom_decommission() – removing the local clone first strips the GitHub remote reference and the remote repo will not be deleted.
Where to go next
The Get Started articles follow a single user journey through six months of a clinical study, from a first extract on a laptop through a full multi-engineer, multi-project governance workflow.
| When you are ready for | Article |
|---|---|
| A second monthly extract – versioning in action | Month 2 Arrives |
| Importing a folder of extracts at once | A Folder of Extracts |
| Moving from local storage to S3 | Promoting to S3 |
| Sharing data with statisticians | Handing Off to a Statistician |
| Governing a portfolio of studies | Governing a Portfolio |
For the design rationale – why two repos, what ref.json does, how SHAs are computed – see the Design articles in the same site.
Where datom fits
datom is the foundational layer for the daapr ecosystem.
| Package | Purpose |
|---|---|
| datom | Version-controlled table storage (this package) |
| dpbuild | Data product construction |
| dpdeploy | Deployment orchestration |
| dpi | Data product access |
See dev/datom_specification.md for the full technical specification.