Where we left off: STUDY-001 is on S3. Four tables, version 1 each.
A statistician on your team has been asked to run an interim analysis. She needs the latest dm/ex/lb/ae snapshots, today, and she needs to be able to re-create the same snapshot three months from now when the safety review asks where the numbers came from.
This article is a role-switch article. You stop
being the engineer for a moment; you become the statistician on a
different laptop, in a different R session, with read-only credentials.
The capabilities introduced are not new functions – they’re a new way of
using datom_get_conn() and datom_read().
What the engineer sends
You (the engineer) message the statistician three pieces of information:
-
Governance repo URL –
https://github.com/acme/datom-governance.git. -
Data bucket / prefix / region –
study-001-datom,""(empty prefix),us-east-1. -
Project name –
STUDY_001.
Plus the credentials she’ll need to set up herself:
- A GitHub PAT with read access to
datom-governanceand the data repo. (Personal account, organization member; no special scope beyondreporead.) - An AWS profile with read access to the data bucket.
You do not send her any data files. The point of datom is that there’s nothing to send – she pulls bytes herself.
What the statistician does
The remainder of this article is from the statistician’s side.
Open a different R session (or a fresh
tempdir()) to follow along.
Set up credentials
# One-time per machine
keyring::key_set("GITHUB_PAT")
keyring::key_set("AWS_ACCESS_KEY_ID")
keyring::key_set("AWS_SECRET_ACCESS_KEY")Resume the prior state
The resume script for article 5 does the work of building a reader conn against the S3 store. Unlike resume scripts 2-4, this one needs network access (S3 + GitHub).
state <- source(
system.file("vignette-setup", "resume_article_5.R", package = "datom")
)$value
reader_conn <- state$conn
print(reader_conn)
#> -- datom connection
#> * Project: "STUDY_001"
#> * Role: "reader"
#> * Backend: "s3"
#> * Root: "study-001-datom"
#> * Prefix: ""The conn’s role is "reader". Reader
connections do not have a local data clone. The statistician never ran
datom_init_repo() and never wrote anything; she just
constructed a store and asked datom for a connection.
Read the latest of each table
library(datom)
dm <- datom_read(reader_conn, "dm")
ex <- datom_read(reader_conn, "ex")
lb <- datom_read(reader_conn, "lb")
ae <- datom_read(reader_conn, "ae")
nrow(dm)
#> [1] 14Same data the engineer wrote. Same parquet bytes. No CSV transfer, no “version 3 with the patches we applied.” One source of truth.
Pin the analysis to a version
The interim analysis report needs a paragraph that says “data was
extracted from STUDY_001 at version X.” That version is the metadata SHA
from datom_history():
hist_dm <- datom_history(reader_conn, "dm")
hist_dm$version[1L]
#> [1] "8a3b21cc9f..."The statistician records 8a3b21cc9f... (and the
equivalent for ex/lb/ae) in her analysis script. Three months from now,
when the auditor asks “where did this number come from,” she runs:
dm_at_analysis <- datom_read(reader_conn, "dm", version = "8a3b21cc9f...")…and gets back the exact bytes the analysis used, even if STUDY_001 has moved on through versions 4, 5, 6.
Why this matters
Three things just happened that don’t happen with shared CSVs:
- No copy was made. The statistician’s analysis pulls directly from the canonical store. No “is your CSV the same as my CSV?” conversation ever happens.
- The version is identifiable. The same SHA references the same bytes on every machine, forever. This is the audit story regulators want.
- The handoff is one-way. The reader role has no write capability – the statistician cannot accidentally create a fork by saving a new version. The engineer remains the data steward.