Goal:
Build, deploy and access a(n overly) simple data product to get familiarized with the concepts and functions.
User story
We are interested in a data product that simply provides distances in
cars dataset in metric unit. See ?cars
for additional
detail about the dataset.
Step 1: Initialize the project
As this is a new project, we initialize the project using
dpbuild::dp_init
. See the getting started vignette
for details of what the initialization does.
library(daapr)
board_params_set_dried <- fn_dry(board_params_set_s3(
bucket_name = "<BUCKET>",
region = "<REIGION>"
))
# Dry function call to setting credentials
creds_set_dried <- fn_dry(creds_set_aws(
key = Sys.getenv("AWS_KEY"),
secret = Sys.getenv("AWS_SECRET")
))
# Initialize dp repo
dp_repo <- dp_init(
project_path = "dp_cars",
project_description = "Cars data product",
branch_name = "us001",
branch_description = "User story 1",
readme_general_note = "Data product to explore cars stopping distance",
board_params_set_dried = board_params_set_dried,
creds_set_dried = creds_set_dried,
github_repo_url = "<GIT PATH/dp_cars.git>"
)
Step 2: Set up the working environment
At this point your project has all the basic components to provide you with a sandbox where you can do your development. It is not necessary, but it may be instructional to clean and restart your R session before this next step. Then, activate and set up the sandbox for this project.
# Switch to project directory
setwd(dp_repo)
# only necessary if you re-started your R session
if (!"daapr" %in% (.packages())) {
library("daapr")
}
# Set up "promised" env variables for remote data repository
Sys.setenv("AWS_KEY" = "<BUCKETS AWS KEY>")
Sys.setenv("AWS_SECRET" = "<BUCKETS AWS SECRET>")
# Set up env variables for remote code repository
Sys.setenv("GITHUB_PAT" = "<YOUR GITHUB PAT>")
# Retrieve configuration
config <- dpconf_get(project_path = ".")
Step 3: Add input data and sync to remote
In this step you go from whatever content you have in the
input_files
folder to metadata representation of the read
datasets. Here we only have one dataset: cars.csv
# Upload data into input_files folder
readr::write_csv(x = cars, file = "./input_files/cars.csv")
# Map all input_files content and clean file labels in the map
input_map <- dpinput_map(project_path = ".")
input_map <- inputmap_clean(input_map = input_map)
# Sync each read files to remote data repo
synced_map <- dpinput_sync(conf = config, input_map = input_map, verbose = T)
# For each sync'd dataset, record info that will help you retrieve as needed
dpinput_write(project_path = ".", input_d = synced_map)
Step 4: Build the data product
Here is where the main logic of the data product is implement and the data product is built.
# read in the input data from what is recorded by dpinput_write
data_files_read <- dpinput_read()
# build your output data
output <- data_files_read$cars(config = config) %>%
dplyr::mutate(dist_m = 0.3048 * dist)
# Structure the input, output, metadata ... you wish to have in your data product
data_object <- dp_structure(
data_files_read = data_files_read,
output = output, config = config
)
# save and log the data product built
dp_write(data_object = data_object, project_path = ".")
Why so many steps if the above chunk is the main logic? The pay off
for all you have done is this: you have built a portable recipe where
your metadata, package dependencies, data and logic are all code! Now,
by simply saving the above chunk as an R-script, let’s say named
dp_make.R
, you can have it reproduced from all the
configurations recorded without having to provide input data. Everything
now is code and can be tracked by git
.
So to make your project reproducible save the above chunk as
dp_make.R
in the main directory. Sourcing this file after
closing should be all that is needed to reproduce the data product.
Step 5: Commit and push
At this point, you can commit and push your code. NOTE: for your push to work
- You should have created the empty repo on the git remote (e.g. github)
-
Sys.getenv("GITHUB_PAT")
returns the corresponding “GITHUB_PAT”
dp_commit(project_path = ".", commit_description = "First dp build")
dp_push(project_path = ".")
Step 7: Data access
Typical access pattern starts with setting up the env vars, but for brevity here we can just use the existing config to connect to the board, get the data and list what else is on the board.
board_object <- dp_connect(board_params = config$board_params, creds = config$creds)
dp <- dp_get(board_object = board_object, data_name = "dp-cars-us001")
dp_list(board_object = board_object)