library(dpi)
Background
This package aims to provide a minimalist and uniform interface for retrieving data products and the associated metadata no matter where the data product is housed. Currently, s3, local, and LabKey are three types of remote repositories for housing data products.
This packages leverages the pins
package for accessing
data. As such, it leverage the concept of boards
for
organizing data into data spaces. As an example, a team
working on multiple clinical studies, may consider defining a data space
or board
for each study, each of which defined by a
governance policy. In this example, a board
is not
dissimilar to the idea of a directory or folder. A board can be
instantiated on different data platforms (e.g. AWS s3) supported by
pins
or its extensions.
As different data platforms (e.g. s3 vs. LabKey) have their own specific url and ways of authentication, different parameters need to be set up for how connection to the remote data repository is to be established. Once the connection is established, however, all other interactions (e.g. list, get, version, etc.) are done using the same set of functions. With that in mind, the steps outlined below are divided into
- Setting up the connection
- Interacting with the data product
Set up connection
Setting up connection to data products involves setting
- Board parameters: parameters relating to the remote location of the data product
- Credential parameters: parameters checked by the remote location of the data when trying to access
NOTE: prior to proceeding further verify that you have working credentials for the repository you need to access.
Board parameters
- For s3, you need to know the bucket name and the region.
- For LabKey, you need to know the url of platform and specific folder.
- For local boards, you only need to provide the path of the local folder
Format board parameters
For S3, you need to provide the s3 bucket name and region
board_params <- board_params_set_s3(
bucket_name = "bucket_name",
region = "us-east-1"
)
For local, you need to provide the absolute folder path
board_params <- board_params_set_local(
folder = "/project_folder/subfolder"
)
Format your credentials
This can be done with cred_set_aws
or
cred_set_labkey
functions. The cred_set function signatures
informs you of the type of parameters needed.
NOTE: For security reasons, avoid putting your
credentials inside your script. It is highly recommended to establish a
habit of using environment variables in conjunction with secret
management packages like keyring
. Below, exemplifies this
pattern, where the env variables are set using
Sys.setenv("AWS_KEY" = "xxxx")
in console, or using
keyring
,
Sys.setenv("AWS_KEY" = keyring::key_get(service = "AWS_KEY")
.
Similarly, “AWS_SECRET” can be set as an environment variable.
The signature for creds for AWS s3 is takes two parameters.
aws_creds <- creds_set_aws(
key = Sys.getenv("AWS_KEY"),
secret = Sys.getenv("AWS_SECRET")
)
Connect and interact with data product
With the connection parameters established, interaction with the data product use the same set of function and format no matter where the data is housed.
Connect to the board
Then we can connect to the board.
board_object <- dp_connect(board_params = board_params, creds = aws_creds)
Work with the board
List what data products available on this board we just connected to. Note we need to pass the board object in order to list available data products.
dp_list(board_object = board_object)
Retrieve data
Load data into the working environment, e.g. as object
dp
. Specifying version is optional.
dp <- dp_get(
board_object = board_object,
data_name = "dp-studyid_branchid",
version = "c5a51c3"
)