Docker and Packrat
- Combining docker and
packratallows for better control over package versioning in R.
- I made an example repo that shows how.
The rocker set of docker images has made it much easier to do reproducible data analysis with R. By using one of the rocker images, we can ensure that the computing software environment, R version, and package versions are always the same, no matter where the code is being run.
However, one tricky point with this setup is managing R package versions. The rocker project takes the following approach: images are tagged with the R version installed, e.g., 3.5.1, 3.5.2., etc. When you install packages to a tagged image, they are installed from MRAN, which keeps daily snapshots of CRAN going back to 2014. The package version installed is the one from the last day that version of R was most current. So, if you are using the
rocker/tidyverse:3.3.1 image and you install a package, it will be the package version from 2016-10-31 (they day before 3.3.2 was released). If you use the default
latest tag (or equivalently, the current R version, as of writing 3.5.3), the most current packages will be installed.
This is fine if you aren’t too concerned about specific package versions, and just need to keep things consistent: you could use a version-tagged image, and be done with it. But what if we want to run, e.g., R 3.5.3 with a specific package version? Simply using the
latest tag isn’t a good option, because this will change day-to-day. You could rebuild the image using the exact same
Dockerfile and end up with different package versions.
This is where
packrat comes in1. At first I was dubious of using
packrat with docker because I thought it was overkill – as described above, the rocker images almost take care of that for us. But it is perfect for when you need to have finer control over versioning.
Part 1: Take a 📸
packrat is a package for managing R packages. The basic idea is that instead of using the default location to install packages that are shared across all R code, each project gets its own private library of packages. That way, package versions are independent from project to project.
We will rely on just two
snapshot() writes a text file to
packrat/packrat.lock documenting all package versions, where they came from (i.e., which repository they were installed from), and the version of R in use.
restore() installs the packages exactly as listed in
The first step is to write the
packrat.lock file with
snapshot(). I will do this from the docker container so it tracks the correct version of R. You could use any rocker image, but I prefer the
rstudio image because it allows us to run RStudio server to edit code in the container.
docker run -it -e DISABLE_AUTH=true rocker/rstudio:3.5.3 R
Inside the container, install
install.packages("packrat", repos = "https://cran.rstudio.com/")
Switch on “packrat mode” with
packrat::init(). Now, packages will be installed to the project-specific library instead of the system-wide library.
packrat::init( infer.dependencies = FALSE, enter = TRUE, restart = FALSE)
We are using
packrat in a very bare-bones fashion and therefore changing several of the defaults, which bear explaining:
packratto try and find all of the packages being used in this project by scanning all of the R scripts and looking for package names in calls to
library()and such. My approach here (which I picked up from Miles McBain’s blogpost) is to avoid this behavior and be explicit about which packages to track. This allows us to have full control over versioning.
enter = TRUEtells packrat to that we want to enter “packrat mode” now, not after restarting R.
restart = FALSEtells packrat not to restart R after the call to
Next, specify which repository(ies) to use to download packages. You would need to add others here if you were also using e.g., Bioconductor. This allows
packrat to track package origins.
my_repos <- c(CRAN = "https://cran.rstudio.com/") options(repos = my_repos)
Install packages to your heart’s content. For example purposes, I’m just going to install one package with no dependencies from CRAN.
cran_packages <- c("glue") install.packages(cran_packages)
Once packages installations are complete (this can take a while if you have more than a handful of packages, especially if they have a lot of dependencies), save the current state with
packrat::snapshot( snapshot.sources = FALSE, ignore.stale = TRUE, infer.dependencies = FALSE)
Again, we are using several non-default settings for
snapshot.sources = FALSEtells
packratnot to save the entire source code of every package. We are fine with just installing from CRAN again when we run
ignore.stale = TRUE. This is a kind of weird setting where
packratwould give a warning if a package had been installed by
packratbut differed from the last snapshotted version. We don’t care about this, and just want to snapshot everything as it is.
infer.dependecies = FALSEas above.
You should now have a
packrat folder in your working directory, which contains a file called
packrat.lock with all package version information inside.
Note that there are bunch of other files in the
packrat folder that we don’t need to track. If you are using
git for version control, I recommend adding the following lines to your
Part 2: Build the image
Really the whole point of Part 1 was just to obtain the
packrat.lock file. Now we will put that to use to install packages to the docker image.
I hardly ever build docker images interactively. Instead, it is much better from a reproducibility standpoint to use
Dockerfile for an image using
packrat looks like this:
FROM rocker/rstudio:3.5.3 RUN apt-get update COPY ./packrat/packrat.lock packrat/ RUN install2.r packrat RUN Rscript -e 'packrat::restore()' # Modify Rprofile.site RUN echo '.libPaths("/packrat/lib/x86_64-pc-linux-gnu/3.5.3")' >> /usr/local/lib/R/etc/Rprofile.site
This should be saved as the
Dockerfile in the root of the project directory. So the project directory tree might look like:
├── code ├── data ├── Dockerfile ├── install_packages.R ├── packrat │ └── packrat.lock └── results
You would then build the image and run containers the usual way:
docker build . -t mycontainer docker run -it mycontainer R
Notice the last line of the
RUN echo '.libPaths("/packrat/lib/x86_64-pc-linux-gnu/3.5.3")' >> /usr/local/lib/R/etc/Rprofile.site. This is needed because for some reason Rstudio server doesn’t parse
Rprofile normally, but we need to tell R to use the
packrat library instead of the default library for loading packages. I found a workaround by just modifying the
Rprofile.site file used by Rstudio server.
I don’t recommend doing all of the above by hand. Rather, I have a script called
install_packages.R that I use to create the
packrat.lock file. I’ve posted this as an example repo that installs one packge each from CRAN, Bioconductor, and GitHub.
Note that you could use the tips from my last blog post to install private packages from GitHub without inadvertantly writing your credentials into the docker image.