Skip to contents

S3 is where datom turns into shared, governed infrastructure. By the end of this article you can:

  • Give a teammate read-only access to the project – your cloud admin generates scoped credentials, the teammate passes them to datom_store_s3(), and datom’s reader role works out of the box.
  • Register the project in a shared governance portfolio, discoverable by name across your organization.

Moving to S3 also lays the foundation for capabilities datom will integrate over time: access logs, automated retention rules, cross-account replication. None of that requires any code changes on your part – it follows from choosing object storage as the backend.

From your code, nothing changes: the same datom_write(), datom_read(), and datom_history() calls you’ve used so far.

Two ways to read this article

  • Starting fresh on S3 – skip the promotion sections (labeled below); jump straight to Set up AWS credentials.
  • Promoting an existing local-filesystem project to S3 – follow in order. You’ll snapshot the current data, retire the local project, then re-establish it on S3.

Both paths converge at Build the S3 store.

Where we left off (promotion path): STUDY-001 has four tables (dm, ex, lb, ae), all in a local filesystem store. The data git repo is on GitHub. No governance layer is attached yet – articles 1-3 stayed deliberately local-only.

What promotion looks like today (promotion path only)

Starting fresh on S3? Skip to Set up AWS credentials.

A built-in, history-preserving migration (datom_migrate_data()) is planned but not yet shipped. Today, promoting a project means:

  1. Snapshot the current version of each table.
  2. Retire the local project (datom_decommission()).
  3. Initialize a new project on S3 with the same name.
  4. Re-write the snapshotted tables as version 1 on S3.

The trade-off is that per-table version history from the local era is not carried forward – only the latest version of each table is. For a study with a few months of extracts this is cheap; the git commit log preserves the narrative even when the data history restarts.

If preserving full per-version history across the move matters to you right now, the cleaner path is to start a new project directly on S3 (fresh-start path) and write your data there going forward.

Set up AWS credentials

datom_store_s3() takes access_key and secret_key as plain strings. How you supply them is up to you:

# Option A: inline (fine for interactive sessions; don't commit to git)
access_key <- "AKIAIOSFODNN7EXAMPLE"
secret_key <- "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"

# Option B: environment variables (CI/CD, Docker)
access_key <- Sys.getenv("AWS_ACCESS_KEY_ID")
secret_key <- Sys.getenv("AWS_SECRET_ACCESS_KEY")

# Option C: keyring (recommended for interactive developer machines)
access_key <- keyring::key_get("AWS_ACCESS_KEY_ID")
secret_key <- keyring::key_get("AWS_SECRET_ACCESS_KEY")

The rest of this article uses the keyring form as a placeholder – substitute whichever pattern fits your environment.

You will also need:

  • A bucket you can read and write to. datom does not create buckets – bucket lifecycle (encryption, versioning, retention) is your organization’s policy domain, not datom’s.
  • A prefix within the bucket. For raw clinical data we recommend one bucket per study with an empty prefix at the bucket root. Derived data products (ADaM, TLF) then live under named prefixes (adam/, tlf/) in the same bucket. See Buckets and Prefixes for the full convention and alternatives.

The full credential reference – including scoped reader credentials and how to handle assume-role flows – is in Credentials in Practice.

Resume the prior state (promotion path only)

Starting fresh on S3? Skip to Build the S3 store.

state <- source(
  system.file("vignette-setup", "resume_article_4.R", package = "datom")
)$value

old_conn <- state$conn
dev_dir  <- state$dev_dir

old_conn is the local-backend conn from article 3. We’ll use it to read the four current tables, then decommission it.

Snapshot the current data (promotion path only)

Before tearing anything down, capture the latest version of each table in memory:

snapshot <- list(
  dm = datom_read(old_conn, "dm"),
  ex = datom_read(old_conn, "ex"),
  lb = datom_read(old_conn, "lb"),
  ae = datom_read(old_conn, "ae")
)

Decommission the local project (promotion path only)

datom_decommission() deletes the data GitHub repo, clears the local clone and parquet store. It is destructive and requires you to type the project name as confirm to proceed.

datom_decommission(old_conn, confirm = "STUDY_001")
#> i Deleting data storage objects...
#> v Data storage objects deleted.
#> i Deleting GitHub repo "your-org/study-001-data"...
#> v Deleted GitHub repo "your-org/study-001-data".
#> i Removing local clone /tmp/.../study_001_dev...
#> v Removed local clone.
#> v Decommissioned "STUDY_001".

Because no governance layer was attached, decommission is data-side only: nothing to unregister from gov.

Build the S3 store

Both paths resume here.

library(datom)
library(fs)

dev_dir <- path(tempdir(), "study_001_dev")  # fresh local clone target

aws_data <- datom_store_s3(
  bucket     = "study-001-datom",        # <-- one bucket per study (Pattern A)
  prefix     = "",                       # raw data at the bucket root
  region     = "us-east-1",
  access_key = keyring::key_get("AWS_ACCESS_KEY_ID"),
  secret_key = keyring::key_get("AWS_SECRET_ACCESS_KEY")
)

store <- datom_store(
  governance = NULL,          # no gov yet; attached below
  data       = aws_data,
  github_pat = keyring::key_get("GITHUB_PAT")
)

Initialize STUDY_001 on S3

datom_init_repo(
  path         = dev_dir,
  project_name = "STUDY_001",
  store        = store,
  create_repo  = TRUE,
  repo_name    = "study-001-data"
)

conn <- datom_get_conn(path = dev_dir, store = store)
print(conn)
#> -- datom connection
#> * Project: "STUDY_001"
#> * Role: "developer"
#> * Backend: "s3"
#> * Root: "study-001-datom"
#> * Prefix: ""
#> * Governance: not attached

The data backend is now "s3". Every datom_write() will upload parquet to S3 and every datom_read() will stream it back from S3.

Write your first tables

Promotion path: re-write the snapshotted tables as version 1 on S3.

# Promotion path only -- snapshot was captured above
datom_write(conn, snapshot$dm, "dm",
            message = "Re-establish dm on S3 (was local through 2026-03-28)")
datom_write(conn, snapshot$ex, "ex",
            message = "Re-establish ex on S3 (was local through 2026-03-28)")
datom_write(conn, snapshot$lb, "lb",
            message = "Re-establish lb on S3 (was local through 2026-03-28)")
datom_write(conn, snapshot$ae, "ae",
            message = "Re-establish ae on S3 (was local through 2026-03-28)")

Per-table version history from the local era is not carried forward – the commit messages above are your audit trail.


Fresh-start path: write the first extract directly.

# Fresh-start path only -- use the built-in example data
cutoff <- "2026-01-28"
datom_write(conn, datom_example_data("dm", cutoff_date = cutoff), "dm",
            message = paste("dm: first extract, cutoff", cutoff))
datom_write(conn, datom_example_data("ex", cutoff_date = cutoff), "ex",
            message = paste("ex: first extract, cutoff", cutoff))
datom_write(conn, datom_example_data("lb", cutoff_date = cutoff), "lb",
            message = paste("lb: first extract, cutoff", cutoff))
datom_write(conn, datom_example_data("ae", cutoff_date = cutoff), "ae",
            message = paste("ae: first extract, cutoff", cutoff))

Attach the governance layer

Now that STUDY-001 lives somewhere a teammate can reach, register it in a shared governance repo. Governance is a two-step setup:

  1. Once per organization: datom_init_gov() creates the gov GitHub repo and seeds the skeleton (projects/ directory, README, etc.). Run this the very first time anyone in your org adopts governance.
  2. Once per project: datom_attach_gov() records STUDY-001’s data location in projects/STUDY_001/ref.json and updates project.yaml so any future conn from this clone knows where gov lives.
gov_store <- datom_store_s3(
  bucket     = "acme-datom-gov",         # <-- one dedicated gov bucket per organization
  prefix     = "",                       # dedicated bucket -> empty prefix
  region     = "us-east-1",
  access_key = keyring::key_get("AWS_ACCESS_KEY_ID"),
  secret_key = keyring::key_get("AWS_SECRET_ACCESS_KEY")
)

gov_dir <- path(tempdir(), "datom-governance")  # explicit local path

# Step 1: seed the gov repo (once per organization)
gov_repo_url <- datom_init_gov(
  gov_store      = gov_store,
  gov_local_path = gov_dir,
  create_repo    = TRUE,
  repo_name      = "datom-governance",
  github_pat     = keyring::key_get("GITHUB_PAT")
)
#> v Created gov GitHub repo `datom-governance`
#> v Seeded skeleton (projects/, README.md)
#> v Pushed initial commit

# Step 2: attach this project to gov (once per project)
conn <- datom_attach_gov(
  conn           = conn,
  gov_store      = gov_store,
  gov_repo_url   = gov_repo_url,
  gov_local_path = gov_dir
)
#> v Registered STUDY_001 in governance
#> v Updated project.yaml with governance pointer

print(conn)
#> -- datom connection
#> * Project: "STUDY_001"
#> * Role: "developer"
#> * Backend: "s3"
#> * Root: "study-001-datom"
#> * Prefix: ""
#> * Gov backend: "s3"
#> * Gov root: "acme-datom-gov"

Once attached, gov cannot be detached – project.yaml’s storage.governance block is permanent. Subsequent projects in the same organization reuse the same gov repo and bucket; you only run create_repo = TRUE once.

Confirm

datom_list(conn)
#>   name current_version current_data_sha last_updated
#> 1   ae    19f44e3a       e91d04ff         2026-04-29T...
#> 2   dm    8a3b21cc       c2e80a14         2026-04-29T...
#> 3   ex    5d72e0f1       88a73e02         2026-04-29T...
#> 4   lb    c1ffea90       4c3812dd         2026-04-29T...

Where you are

  • STUDY_001 lives on S3. Your local clone is just a working copy of the git metadata.
  • Governance is attached: STUDY_001 is registered in a shared gov repo and gov bucket. Future projects in the same organization reuse both.
  • ref.json in the gov repo points at the S3 bucket; any teammate who clones the gov repo and has S3 read credentials can discover and read the data.

In the next article, you hand the project off to a statistician who needs to read the data without write access – the canonical reader role.

Teardown

Skip this if you plan to continue to the next article – the S3 project and gov registration are the starting state for Article 5: Handing Off.

If you want to clean up:

# Remove all datom artefacts for this project (S3 data, GitHub data repo,
# local clone, gov storage prefix, gov git registration).
datom_decommission(conn, confirm = "STUDY_001")

After decommission, the only remaining artefact is the governance infrastructure itself (gov GitHub repo and gov S3 bucket root). These are shared across projects and datom does not destroy them automatically. Delete them manually once you are done with all gov-dependent articles:

# Delete the gov GitHub repo
system2("gh", c("repo", "delete", "your-org/datom-governance", "--yes"))

The gov S3 bucket content was already removed by datom_decommission(). If the bucket is otherwise empty, delete it via your AWS console or S3 management tooling.

````