12 Package data

Some of the data we have to work with might be very large. If the final datasets we want to produce are too large, we can’t directly include them in the package because there are size limits and recommendations. In that case, we will have to export them as functions. In this way, the package itself won’t contain the output dataset, but the user will have to generate it through running a function, so it will be stored in their own computer. An example of this is our function get_wide_cbs(). Unless we generate the dataset ourselves in the function, we will also probably be using a very large input dataset which we just process somehow. For reading large datasets, see the relevant Reading large files section.

From now on, in this section we assume that you have some small datasets (maybe up to a couple of megabytes) as inputs. We will go through both public and private ones. The public ones are those that you want to export to users of your package, and the private ones are only intended for being used in your own code.

Let’s start with the exported datasets. The whole aim of this is to allow users (or even our own code) to access them easily by writing my_pkg::my_dataset. In order to achieve this, you must follow these steps:

Create a file in data-raw/my_dataset.R. Scripts inside the data-raw folder aren’t included as part of the package. They will just be helpers for us to generate the actual data. Inside this file we do some processing and, assuming we have a variable called my_dataset, we end the script by calling usethis::use_data(my_dataset). This will automatically create an .rda file in data/my_dataset.rda. We have to manually run this script. After that, we can now refer to the dataset as my_pkg::my_dataset.

You can directly create your data in the script data-raw/my_dataset.R, or you can make it rather short by just importing some data from another raw file. In this case, I recommend having the raw file as a CSV in the inst/extdata folder, say, inst/extdata/my_raw_dataset.csv. This is for accessibility, so that everyone can see where this data comes from regardless of whether they know how to read an .rda file or not. A data-raw/my_dataset.R script could then look like:

my_dataset <- here::here("inst", "extdata", "my_raw_dataset.csv") |>
  readr::read_csv()

usethis::use_data(my_raw_dataset, overwrite = TRUE)

Every time you introduce some change in the raw CSV file, you would have to run this script again. The overwrite = TRUE is exactly for this purpose, so that the my_dataset.rda file is overwritten with the updated data.

Document your dataset. In the previous section we learned how to document functions. Datasets aren’t functions, but they’re documented very similarly. We start by creating a file R/my_dataset.R. Note that the name matches that of data-raw/my_dataset.R. It doesn’t need to match that of the variable used with usethis::use_data(), but they should match each other. You can define more than one dataset in the same file if you think they’re related, so then you can also use a more general name for the file. This is how you would document your dataset, also using roxygen2 comments:

#' Title of my dataset
#'
#' My description of my dataset
#'
#' @format
#' What my dataset is. I would ideally make it a tibble and explain all
#' columns. My dataset contains the following columns:
#' - `column_1`: My explanation of column 1.
#' - `column_2`: My explanation of column 2.
#' - `column_3`: My explanation of column 3.
#'
#' @source Where my data comes from. Maybe an external link if you have one.
"my_dataset"

As you can see, we use roxygen2 style comments right before a line containing a character vector with the name of our dataset, in this case "my_dataset". Now your dataset will be correctly documented after doing devtools::document() and pkgdown::build_site()/pkgdown::build_reference().

Now we should talk about internal data. This is data that only the developers of the package themselves use throughout the code. This could be either actual tibble datasets or just bare constants. Any value that doesn’t change and you would like to share throughout the whole package code applies for this. Creating internal data is quite similar to exported data:

Create a file data-raw/constants.R if it doesn’t already exist. For internal data, all of them should be defined in this same file. The file could look like this:

my_constant_number <- 0.65
my_constant_name <- "name"
my_constant_tibble <- tibble::tribble(
  ~col_1, ~col_2,
  1,      2,
  3,      4
)

usethis::use_data(
  my_constant_number,
  my_constant_name,
  my_constant_tibble,
  internal = TRUE,
  overwrite = TRUE
)

As you can see, you can pass more than one variable to usethis::use_data(). We should include all our constants in the same call to this function. In addition, it must also include the internal = TRUE option to identify this as internal data.

Manually run the previous file. This will create a single file in R/sysdata.rda, which contains all your internal data. You can now refer to these data the same way as for exported data, e.g., my_pkg::my_constant_tibble or my_pkg::my_constant_number, but these will only be available through the package’s code, and won’t be exported to the package users. Again, any time you want to add new internal data or modify the existing entries, you will have to manually run that script again.

Whether some data is worth being exported as part of the package or just used as internal, this is your decision, but now you know how to implement both. This section was heavily inspired by the Data chapter in the R Packages book, which I recommend reading if you want to dive deeper.