Background
A data product with a broader scope may encompass one or more other
data products. In such cases, data products can be directly imported
within the data processing logic and they are not considered
input_data
.
User Story
We need to have information we derived and encoded previously within
dp_cars-us001
and information within
dp_mtcars-us001
(which is simply a toy data product based
on mtcars
dataset we have built).
Step 1: Initialize the project
As this is a new project, we initialize the project using
dpbuild::dp_init
. See the vignette
on new project workflow for details of what the initialization does.
library(daapr)
board_params_set_dried <- fn_dry(board_params_set_s3(
bucket_name = "<BUCKET>",
region = "<REIGION>"
))
# Dry function call to setting credentials
creds_set_dried <- fn_dry(creds_set_aws(
key = Sys.getenv("AWS_KEY"),
secret = Sys.getenv("AWS_SECRET")
))
# Initialize dp repo
dp_repo <- dp_init(
project_path = "dp_xcars",
project_description = "Cars and mtcars combined data product",
branch_name = "us001",
branch_description = "User story 1",
readme_general_note = "Data product combining cars and metcars",
board_params_set_dried = board_params_set_dried,
creds_set_dried = creds_set_dried
)
Step 2: Set up the working environment
At this point your project has all the basic components to provide you with a sandbox where you can do your development. It is not necessary, but it may be instructional to clean and restart your R session before this next step. Then, activate and set up the sandbox for this project.
setwd(dp_repo)
# only necessary if you re-started your R session
if (!"daapr" %in% (.packages())) {
library("daapr")
}
# Set up "promised" env variables
Sys.setenv("AWS_KEY" = "<BUCKETS AWS KEY>")
Sys.setenv("AWS_SECRET" = "<BUCKETS AWS SECRET>")
# Retrieve configuration
config <- dpconf_get(project_path = ".")
# Set up remote git env variable so you can directly push your code to git
# remote. This will only be used when you are pushing the code, but we can set
# it now as we are setting up our working env. If you don't want to deal with
# this you can skip this. You can still go quite far in the workflow
Sys.setenv("GITHUB_PAT" = "<YOUR GITHUB PAT>")
Step 3: Build the data product
Here is where the main logic of the data product is implement and the
data product is built. Note, if we had additional data that needed to be
included in the data product, we would mapped and synced it as done when
building dp_cars
, prior to build the dp_xcars
logic.
board_object <- dp_connect(board_params = config$board_params, creds = config$creds)
# specify which versions of data products to use
version1 <- "<version1>"
version2 <- "<version2>"
dp_cars <- dp_get(
board_object = board_object,
data_name = "dp-cars-us001", version = version1
)
dp_mtcars <- dp_get(
board_object = board_object,
data_name = "dp-mtcars-us001", version = version2
)
# build input data by converting input data products to links
input <- list(
dp_cars = dp_tolink(
data_name = "dp-cars-us001",
data_version = version1
),
dp_mtcars = dp_tolink(
data_name = "dp-mtcars-us001",
data_version = version2
)
)
# build your output data
output <- list(cars = dp_cars$input$cars, mt_cars = dp_mtcars$input$mtcars)
# Structure the input, output, metadata ... you wish to have in your data product
data_object <- dp_structure(
data_files_read = input,
output = output, config = config
)
# save and log the data product built
dp_write(data_object = data_object, project_path = ".")
Step 5: Commit and push
At this point, you can commit and push your code. NOTE: for your push to work
- You should have created the empty repo on the git remote (e.g. github)
-
Sys.getenv("GITHUB_PAT")
returns the corresponding “GITHUB_PAT”
dp_commit(project_path = ".", commit_description = "First dp build")
dp_push(project_path = ".")
Step 7: Data access
Typical access pattern starts with setting up the env vars, but for brevity here we can just use the existing config to connect to the board, get the data and list what else is on the board.
board_object <- dp_connect(board_params = config$board_params, creds = config$creds)
dp <- dp_get(board_object = board_object, data_name = "dp-xcars-us001")
dp_list(board_object = board_object)