R in Production

posit::conf(2024)

Author

Hadley Wickham (Primary), Ekaterina Khvatkova (Secondary)

Published

August 18, 2024

Modified

August 18, 2024

DISCLAIMER: The content of this document contains ideas, code, and text developed and distributed by Wickham during his R in Production workshop at the Posit 2024 Conference in Seattle, WA. Wickham has shared this content via his public GitHub repository (follow link for source) and during his workshop presentation (which occurred on August 12th, 2024 from 9:00 AM - 5:00 PM PDT). Wickham’s original content is licensed under a Creative Commons Attribution 4.0 International License. Thus, primary authorship of this document belongs to Wickham. Secondary authorship for Khvatkova, an attendee of this workshop, comes through the creation of this Quarto document (which in its essence is notes taken during and after the workshop), the elaboration of tutorial steps, and the synthesis of Wickham’s content relative to her work as a statistical geneticist in an academic research setting. All of the ideas disseminated by Wickham have been subject to Khvatkova’s interpretation, and this document has not been approved or endorsed by Wickham or Posit BBC.

§1: Introduction

For more notes on making a Quarto document see this Quarto Page Layout Example Guide

§1.1: What is R in Production?

  • Code is run on another machine (EX: linux server)
  • Code is run repeatedly
  • Code (and data) is a shared responsibility

Ideally, if it’s your production job, someone else should be able to run it if you’re gone without asking you any questions. Also, if you know you’re eventually going to have to scale-up production (EX: move from internal development to publishing research code and content on GitHub or to a journal), you should try to learn to develop code that

  1. Follows the best and most current practices in data science for creating projects within your discipline
  2. Allows for streamlined scale-up for publication

While statistical genetics often involves multilingual coding, the use of R and its intersection with data science means we should try to implement some of the techniques learned in this workshop to write better code more efficiently.

Note that not all of the code we develop is written with the intention of published scale up (EX: developing code to learn how an R package works using asmall-scale testing datasets is not the same thing as the code we would publish in a research publication). Also, not everything discussed in this workshop would be the right fit for academic research code and data production (some things are more oriented to business/marketing, and medical data might have security restrictions on storage and publication). However, there are still plenty of things we can learn from the data science community about how to be better at production.

§1.2: Batch and Interactive Jobs: Two Main Categories of Production Jobs

Batch: These are the things that run every day and often sequentially; they are run in the background, so it they can take however long to run (EX: R scripts, Rmd, qmd)

Interactive Jobs: A human is interacting with it and expecting results in real time; someone is waiting for the result; more than one person can be interacting with it at once (EX: Shiny app, plumber API)

The verbs used to describe these processes are a little different to describe these: A batch job is executed (if it’s a script) or rendered (if its a document) in the production environment. It’s then published to somewhere where you can see the results. An interactive job is just deployed, and is automatically executed when someone uses it.

Production is important for everything; however, the details of deploying code vary tremendously based on the project, organization,a and industry. This workshop mainly focused on the following two resources:

  • Github Actions (mostly focusing on this)
  • Posit Connect Cloud

§1.3: Publishing Tools vs. Production Tools

Publishing tools are different from production tools (publishing tools are hosts for your content).

Github Actions: “GitHub Actions is a continuous integration and continuous delivery (CI/CD) platform that allows you to automate your build, test, and deployment pipeline.”

Source of Quote (Accessed 08/16/2024): GitHub Docs GitHub Actions Webpage

  • Free for public repos
  • Github actions mainly supports batch jobs (not interactive jobs)
  • It’s a bit more abstracted (you have to tell it to download r packages)

Posit Connect Cloud

  • Only supports interactive jobs
  • It’s high level, you basically tell it to run an r script and it will pull all of the details for you

Commonalities

  • Your git repo has everything you need to run your job
  • Really git is the best thing for collaboration;

§1.4: Challenge in Production: Having a “Source of Truth”

One of the biggest challenges in collaborative development is establishing a clear “source of truth” in our code and data. For example, two people working on the same project on a shared network drive might develop code to do the same task, and create similar output files; however, establishing which resulting dataset or script is the “true” version can become challenging and confusing. Someone might also notice and correct a mistake in someone’s code and produce a dataset, and it can be hard for an individual to tell when and where their mistake was corrected.

To try to minimize duplicated work and communicate clear version control, you need to establish a place that you keep as your source of truth for every project and communicate the line of truth through comments and version control. A great way to do this is with a git repository (because you can see a working log of how everyone else has contributed to that “truth” before you commit changes). Otherwise, coming up with your own version control annotation schema that is standardized for a given project or organization is a worthwhile task that saves time in the long run.

Git is by far the most popular means of doing this; if you’re not already using it, you should consider it.

§2: Git and GitHub

§2.1: Turning an R Project into a GitHub Project

Step 1: Make Github account

  • We don’t *need* a GitHub Pro account for this; but if you want to make a private repository, you need GitHub Pro
  • As of August 16th, 2024, individuals at academic institutions can get a Github Pro account for free

Step 2: Download git software to your computer

Step 3: Create your R Project

Using the RStudio or Positron IDE, you can create a project by selecting File > New Project > New R Project.

Alternatively, you can download the R library usethis and run the command create_project("~/…") in the R console with full path to where you want to store your project:

# Install and load package
install.packages("usethis")
library(usethis)

# This is where I stored this project on my Mac computer
create_project("~/desktop/R-in-Production-Workshop-positconf2024") 

Step 4: Configure git on your computer to be linked to your GitHub account and your R project

In a terminal, run the following commands with your Github username and user email:

git config –global user.name “Mona Lisa”
git config –global user.email “mona.lisa@gmail.com”

Then, you will need to log into your GitHub profile in a web browser and request a personal access token. For more instructions, you can use the command usethis::gh_token_help() or check out this link. Once you have copied your token, run the commands in your R console:

usethis::use_git()
usethis::create_github_token() # This opens up the web browser to get the token
gitcreds::gitcreds_set() # This prompts you to enter token in R console

Output:

> usethis::create_github_token()
 Call gitcreds::gitcreds_set() to register this token in the local Git
  credential store.
 It is also a great idea to store this token in any password-management
  software that you use.
 Opening URL
  <https://github.com/settings/tokens/new?scopes=repo,user,gist,workflow&description=DESCRIBE
  THE TOKEN’S USE CASE>.
> gitcreds::gitcreds_set()


? Enter password or token: ghp_RbBhrjov0LUP095ogHJTbMv1z7u5d5h0Ku7ZT
-> Adding new credentials…
-> Removing credentials from cache…
-> Done.

Step 5: Upload your project to GitHub

usethis::use_github(private=TRUE) # This command turns your R project on your local computer to a GitHub project accessible on your web browser; option private=TRUE makes it private, default is FALSE

Output:

> usethis::use_github(private=TRUE)
✔ Creating GitHub repository “my_github_user_name/R-in-Production-Workshop-positconf2024”.
✔ Setting remote “origin” to
  “my_github_user_name/R-in-Production-Workshop-positconf2024”.
✔ Pushing “main” branch to GitHub and setting “origin/main” as upstream
  branch.
✔ Opening URL <https://github.com/my_github_user_name/R-in-Production-Workshop-positconf2024>.

§2.2: Overview of the Basic Workflow

  • create_project(“PATH_TO_FILE”)
  • use_git()
  • use_github()

§3: .Rprofile Files, DESCRIPTION Files, and .json Manifests

§3.1: Setting up your .Rprofile

“.Rprofile files are user-controllable files to set options and environment variables.”

Source of Quote (Accessed 08/17/2024): Posit Support Guide “Managing R with .Rprofile, .Renviron, Rprofile.site, Renviron.site, rsession.conf, and repos.conf”

.Rprofile can be created at the user or project level. You can use the command use_usethis() to suggest a code chunk to put in your .Rprofile file (I believe it automatically opens your user-level profile, but search online for how project-level .Rprofile can be accessed using source() and usethis). If you add this chunk it suggests to your .Rprofile, usethis is loaded each time.

usethis::use_usethis() # COMMAND TO RUN

Output:

usethis::use_usethis()
 Include this code in .Rprofile to make usethis available in all interactive
  sessions:
  if (interactive()) {
    suppressMessages(require(usethis))
  }
  [Copied to clipboard]
 Modify /Users/username/.Rprofile.
 Restart R for changes to take effect.

Another good command:

# Tell RStudio not to save/reload sessions
usethis:::use_blank_slate()
# (Positron does this by default)

§3.2: Setting up your DESCRIPTION

“The DESCRIPTION file provides overall metadata about the package, such as the package name and which other packages it depends on.”

Source of Quote (Accessed 08/17/2024): R Package (2e) Metadata Webpage

This file type is special thing about R - this DESCRIPTION file is important for reproducability and record-keeping. Understanding how to read these files is a good skill.

usethis::use_description()
use_package("rmarkdown") # Adds rmarkdown to Imports field in DESCRIPTION

§3.3: Manifest files

“Using manifest.json, you specify basic metadata about your extension such as the name and version, and can also specify aspects of your extension’s functionality (such as background scripts, content scripts, and browser actions)”

Source of Quote (Accessed 08/16/2024): MDN Web Docs Webpage

  • rsconnect:Write_Manifest() writes the manifest file
  • .json file is like YAML file

§4: Publication

§4.1: Posit Connect Cloud

“Posit Connect Cloud is a new online platform to simplify the deployment of data applications and documents.”

Source of Quote (Accessed 08/18/2024): Connect Cloud User Guides: Posit Docs

“Posit Cloud lets you access Posit’s powerful set of data science tools right in your browser–no installation or complex configuration required.”

Source of Quote (Accessed 08/18/2024): Friction free data science - Posit Cloud

  • Right now, this is free and easy to set up using your GitHub account
  • The main use is of this is for publishing shiny apps
  • With GitHub actions, you have to set up a lot of your package downloading (and version control) manually, but with Posit Connect Cloud it basically does all of this for you and with an easy UI experience
  • The downside is that it can be tricky to troubleshoot errors if you get them

§4.2: GitHub Pages

Here are some additional commands than can aid with website publication using GitHub Pages:

  • use_github_pages()
  • use_github_action(url = "https://github.com/posit-conf-2024/r-in-production/blob/main/render-rmd.yaml")

§5: Locating and Logging Errors

Common occurrence in data science: “The solution to your programming issue is not proportional to the amount of time you spend debugging it.”

In this section we will talk about tools to efficiently track and log errors.

5.1: rlang package

“A toolbox for working with base types, core R features like the condition system, and core ‘Tidyverse’ features like tidy evaluation.”

Source of Quote (Accessed 08/18/2024): CRAN Package: rlang

  • My understanding is that a lot of tidyverse functions use rlang to report error sequences already, and that base R reports errors differently (you have probably noticed this variation from error to error when getting errors using different packages)
  • rlang::traceback() prints an error sequence when you create nested function(){}; this sequence is better in some ways than traceback() in base R because it prints something that is more in line with the way R works (EX: hierarchical if nested functions)
  • rlang::global_entrance gives you the same display of errors everywhere; posit packages like dplyr automatically use this
  • print(rlang::traceback) is a way to help you see exactly where an error occurred (adding a few of throughout a script with function(){} can help you find it)

5.2: Logging to stderr() files

“In programming, logging refers to the process of recording or storing information or messages that help in understanding the behavior and execution of a program. It is a way to track the flow of the program and identify any issues or bugs.”

stdin(), stdout() and stderr() are standard connections corresponding to input, output and error on the console respectively (and not necessarily to file streams). They are text-mode connections of class”terminal” which cannot be opened or closed, and are read-only, write-only and write-only respectively.”

Source of Quote (Accessed 08/18/2024): R Documentation for Display Connections

Here is a great quote from Reddit user u/aioeu on the r/bash subreddit:

“Programs are run with three standard streams.

Standard input (file descriptor 0) is the stream from which the program normally reads input. Most of the time that is connected to your terminal, so yes, most of the time what you type will be received over this program over this stream. But you can feed other things into standard input. For instance, you can redirect it to a file (so the program reads from that file instead of the terminal), or you can connect a pipe to it (so the program reads the output of some other program).

Standard output (file descriptor 1) and standard error (file descriptor 2) are two streams over which the program can produce output. Most of the time they are both connected to your terminal, so most of the time the output will be displayed there, but again you can change this by redirecting them to a file or to a pipe. By convention, programs should produce warning and error messages on standard error, and 'all other output' on standard output. This means you can redirect standard output to a file, say, yet leave standard error connected to the terminal so you can continue to read the error messages produced by the program.

< and > are redirection operators for your shell. You use this to redirect standard input and standard output respectively.”

In R you can use the following command to log into your output files:

# If you add this, it will put it in an “out” file; so you can add text that tells you how some process is going as it’s going 
cat(“\n\n\nTEXT\n”, file= stderr()) 
# another function you can use in cat() to log, but cat() already prints it
sprint() 

Logging on a Linux HPC That Uses SLURM Workload Manager:

  • The bash command R CMD BATCH in a SLURM script creates an .Rout file which only has the output from the commands run in the script.
  • R CMD BATCH saves history into .Rdata file unless you specify quit(save="no") at the end of your Rscript
  • The bash command Rscript will create a standard output (.o) and standard error (.e) files
  • A .log file contains more info about usage patterns, activities and operations within an OS and application as well as the program (Source)

What are some things I should log?

  • Adding emojis can help add color to spot things because .log files can get long; also, might not be possible on Linux HPCs because they tend to have limited font libraries
  • The deployment system you use (GitHub actions or Posit Connect Cloud) will capture your comments when you use stderr()
  • One nice thing to add is a “Done!” statement to know when a task has finished executing
  • Messages are worth removing because it gets rid of it in log file
  • options(warn=1) makes them appear immediately
  • options(warn=2) makes the warnings into errors

There are many libraries you can use for logging. These are just a few options:

library(logger):

  • log_threshold(WARN)
  • log_info()
  • log_warn()

library(debugme):

  • debugme::debugme() is better if you care about performance
  • You don’t want 90% of your code to be spent printing logging; it’s a trade off, but if you use it efficiently it can help you troubleshoot code that you were developing at one point

library(logrx): Here is a tutorial on how to log using it

Additionally,

  • Github actions has expanders
  • Most Posit products has a more log type display that’s super long
  • You can use parseable variables in log

§6: Authentication

Authentication is important to ensure data and products remain secure. There are two main categories of authentication that often come up in data science and work in academia:

Encrypted Environment Variables: This essentially storing randomized, personalized strings in such a manner that work as “keys” to access information

Federated Authentication: Uses trusted identity providers

  • IT oftentimes develops these or institutions pay for providers
  • For example, Posit Connect has options for federated authentication

We will focus briefly on encrypted environment variables:

  • On GitHub, you can go to Settings > Secrets and Variables > Actions
  • There you can add secret Repository or Environment variables
  • WARNING: usethis::edit_r_environ(): you can run it on your own laptop; but just so you know someone can use this to get into you account so it is NOT recommended
  • WARNING: sys.setenv(): it will save to your .Rhistory file which is easily accessible by other people
  • The goal is for these to not ever appear in sys.getenv()

Key Takeaway:

Understanding the structure of authentication at your institution can help you troubleshoot authentication errors efficiently. Oftentimes, academic institutions have multiple sectors of IT dedicated to different aspects of authentication - knowing who to contact about what issues is an important skill to develop.

§7: Running Code Repeatedly

The challenge is that you have to think about all the things that can go wrong and all the things that will change over time.

Some General Categories of What Could Change:

  • The data itself can change
  • The schema of the data can change (EX: changing coordinate system from 0-based to 1-based, reporting positive strands to any strand)
  • Variable names can change
  • The content of the variables can change
  • Units can change (Farenheit to Celsius)
  • Packages might update or change
  • Chip architecture can change, OS can change, or python/R might change
  • Also, the “universe” changes; every model you fit is intended to describe the universe… and sometimes there are things that the data does not capture (EX: We might start to view things differently in genetics that affects the fundamentals of how we collect and analyze genetic data)

In this workshop, we focused on

  • Platform
  • Packages
  • Schema

§7.1: Platform

  • As a data scientist, you should know what a container is
  • You don’t need to know how to write them yourself but the basic solution to this is to use containers

§7.2: Packages

There are 4 mitigation strategies here:

  • Make sure you have the right versions
  • Capture versions as a part of deployment
  • Using a project specific library
  • The fewer dependencies you have, the stronger you code is (but it’s typically okay in a dev phase)

Posit Package Manager

Currently has a free version and a non-free version; this service allows for ways to document and efficiently load packages.

renv Package:

  • This is a good package; this can help you with having a location for R libraries for each project; it’s typically good to check these things like every 6 months or so because after a few years packages can change drastically
  • renv::renv_lockfile_from_manifest(“manifest.json”, “renv.lockfile”)
  • renv::restore()
  • renv::update()
  • renv::deactivate()
  • renv::dependencies()

purr Package:

  • There is a way using purr::setnames(); purr::map(\(this_pkg) to see if all of the R packages that you use that
  • Warning is like “you really need to fix this”, and it’s good to try to get rid of it

conflicted Package:

  • conflicted package is useful for when a function that is in two different packages is referenced; it gets rid of the messages with the conflicted packages but when you try to use the functions it will tell you where it is conflicted and it asks you which one you want to use

pak Package:

pak installs R packages from CRAN, Bioconductor, GitHub, URLs, git repositories, local files and directories. It is an alternative to install.packages() and devtools::install_github(). pak is fast, safe and convenient.”

Source of Quote (Accessed 08/17/2024): GitHub Webpage for pak Package

§7.3: Schema

  • Oftentimes this means you need to keep in communication with the people who make the data
  • This is hard but it can help a lot; also generally being accessible is a good habit in the sciences
  • Having code that checks to make sure that the format is correct is important
  • pointblank package is great for this
  • pointblank::export_report() it makes an r markdown file that just tells you a bunch about your data
  • col_is_numeric()
  • There is an anti_join() which tells you what doesn’t join, and code was shared with us that basically allows us to make sure the
  • Data quality reporting; we need to make reports on all of these “suspicious things”
  • Maybe something like production_reports/quality_reports
  • You should have ways in your code to report if the places you are pulling data from (EX: Open-source databases like JASPAR or the UCSC Genome Browser)

§8: Sharing Code

§8.1: Git Repo or R Package?

Asking yourself questions like “Should the R project be a package or should it be a git repo?” Really most things should be a git repo, and the reason why is because there is a hierarchy of data science needs. From the base to the top, it is Find, Run, Understand, and Edit. Public R packages are very hard to edit. However, having internal R packages can be a helpful tool.

§8.2: Refactoring and Team-wide Conventions and Standards

  • As a team we should have tagging schema – it is good to keep track of those conventions that will change over time
  • Refactoring is an important thing: spending a lot of time making your code easier to read and run in the future
  • Doing code reviews is another good tool as a team (noting that not all code is necessarily developed to be perfect in production); but for code you want refactored or publicized, this is a great option
  • Team style guide: this is for how we share data with each other; building consensus is hard, but making decisions can be temporary just to move forward and get everyone aligned; also writing these things down can help you remember the arguments for why you did it
  • ^This is really good to help establish code conventions and figure out how to make the code better (and see what other people’s questions are)
  • A lot of the code that we right is exploratory, and code reviews are not necessarily as valuable then. But when you’re considering putting it in production, it is good to do this especially
  • Sometimes, even just a couple of days can help you be your own reviewer
  • This type of thing also helps with onboarding

Big piece of advice: The amount of time you spend on this should be proportional to impact. If it’s in a top journal it should be a lot of time, but some things we might do hardly anyone will use it or see it (or maybe only one person knows how to do it)

Big piece of advice: If you find that it’s something that you’re going to share three or more times, a style guide is a good idea.

There are also some good file formatting ideas like using parquet file formats or labelling your columns as attributes in all of your projects (or having CSVs with the first row as column names and second row as column descriptions with data beneath to not keep column names and descriptions in separate variable codebooks).

§9: Final R Coding Tips and Factoids

These are some final comments and tips that I have noted but haven’t included in other sections.

  • When you create an R project, a directory /R will automatically be made; When you have repeated functions, you should put those into scripts and those can go in the R directory, and scripts that actually execute tasks should be in a separate file; the “Rule of Three” for making functions can be used here: if you copy and paste a chunk of code more than 3 times within a project, it should probably be it’s own function)
  • Make sure your deployed code links to your product and vice versa; it needs to be clear how everything connects together (not just to you, but to anyone who will be using your code or seeing your products)
  • An R package can be thought of like a book in a library; when you use the library() function in R to load a package, it can be thought of like checking a book out of the library
  • One library for each project can be a good idea for certain projects if storage allows
  • Batching and time boxing are valuable techniques to keep progress moving: in other words, instead of saying “I’m going to fix this”, you can say “I am going to give this two weeks”
  • Keeping a log of things you are going to or want to fix can be a good way for you to remember ways to improve code; but at the same time, there might be things that a year down the line you realize “oh I didn’t fix that and it clearly seems like I don’t need to”