# Install and load package
install.packages("usethis")
library(usethis)
# This is where I stored this project on my Mac computer
create_project("~/desktop/R-in-Production-Workshop-positconf2024")
R in Production
posit::conf(2024)
DISCLAIMER: The content of this document contains ideas, code, and text developed and distributed by Wickham during his R in Production workshop at the Posit 2024 Conference in Seattle, WA. Wickham has shared this content via his public GitHub repository (follow link for source) and during his workshop presentation (which occurred on August 12th, 2024 from 9:00 AM - 5:00 PM PDT). Wickham’s original content is licensed under a Creative Commons Attribution 4.0 International License. Thus, primary authorship of this document belongs to Wickham. Secondary authorship for Khvatkova, an attendee of this workshop, comes through the creation of this Quarto document (which in its essence is notes taken during and after the workshop), the elaboration of tutorial steps, and the synthesis of Wickham’s content relative to her work as a statistical geneticist in an academic research setting. All of the ideas disseminated by Wickham have been subject to Khvatkova’s interpretation, and this document has not been approved or endorsed by Wickham or Posit BBC.
§1: Introduction
For more notes on making a Quarto document see this Quarto Page Layout Example Guide
§1.1: What is R in Production?
- Code is run on another machine (EX: linux server)
- Code is run repeatedly
- Code (and data) is a shared responsibility
Ideally, if it’s your production job, someone else should be able to run it if you’re gone without asking you any questions. Also, if you know you’re eventually going to have to scale-up production (EX: move from internal development to publishing research code and content on GitHub or to a journal), you should try to learn to develop code that
- Follows the best and most current practices in data science for creating projects within your discipline
- Allows for streamlined scale-up for publication
While statistical genetics often involves multilingual coding, the use of R and its intersection with data science means we should try to implement some of the techniques learned in this workshop to write better code more efficiently.
Note that not all of the code we develop is written with the intention of published scale up (EX: developing code to learn how an R package works using asmall-scale testing datasets is not the same thing as the code we would publish in a research publication). Also, not everything discussed in this workshop would be the right fit for academic research code and data production (some things are more oriented to business/marketing, and medical data might have security restrictions on storage and publication). However, there are still plenty of things we can learn from the data science community about how to be better at production.
§1.2: Batch and Interactive Jobs: Two Main Categories of Production Jobs
Batch: These are the things that run every day and often sequentially; they are run in the background, so it they can take however long to run (EX: R scripts, Rmd, qmd)
Interactive Jobs: A human is interacting with it and expecting results in real time; someone is waiting for the result; more than one person can be interacting with it at once (EX: Shiny app, plumber API)
The verbs used to describe these processes are a little different to describe these: A batch job is executed (if it’s a script) or rendered (if its a document) in the production environment. It’s then published to somewhere where you can see the results. An interactive job is just deployed, and is automatically executed when someone uses it.
Production is important for everything; however, the details of deploying code vary tremendously based on the project, organization,a and industry. This workshop mainly focused on the following two resources:
- Github Actions (mostly focusing on this)
- Posit Connect Cloud
§1.3: Publishing Tools vs. Production Tools
Publishing tools are different from production tools (publishing tools are hosts for your content).
Github Actions: “GitHub Actions is a continuous integration and continuous delivery (CI/CD) platform that allows you to automate your build, test, and deployment pipeline.”
Source of Quote (Accessed 08/16/2024): GitHub Docs GitHub Actions Webpage
- Free for public repos
- Github actions mainly supports batch jobs (not interactive jobs)
- It’s a bit more abstracted (you have to tell it to download r packages)
Posit Connect Cloud
- Only supports interactive jobs
- It’s high level, you basically tell it to run an r script and it will pull all of the details for you
Commonalities
- Your git repo has everything you need to run your job
- Really git is the best thing for collaboration;
§1.4: Challenge in Production: Having a “Source of Truth”
One of the biggest challenges in collaborative development is establishing a clear “source of truth” in our code and data. For example, two people working on the same project on a shared network drive might develop code to do the same task, and create similar output files; however, establishing which resulting dataset or script is the “true” version can become challenging and confusing. Someone might also notice and correct a mistake in someone’s code and produce a dataset, and it can be hard for an individual to tell when and where their mistake was corrected.
To try to minimize duplicated work and communicate clear version control, you need to establish a place that you keep as your source of truth for every project and communicate the line of truth through comments and version control. A great way to do this is with a git repository (because you can see a working log of how everyone else has contributed to that “truth” before you commit changes). Otherwise, coming up with your own version control annotation schema that is standardized for a given project or organization is a worthwhile task that saves time in the long run.
Git is by far the most popular means of doing this; if you’re not already using it, you should consider it.
§2: Git and GitHub
§2.1: Turning an R Project into a GitHub Project
Step 1: Make Github account
- We don’t *need* a GitHub Pro account for this; but if you want to make a private repository, you need GitHub Pro
- As of August 16th, 2024, individuals at academic institutions can get a Github Pro account for free
Step 2: Download git software to your computer
- Git is automatically installed on Mac computers
- Download via link for git on Windows OS
Step 3: Create your R Project
Using the RStudio or Positron IDE, you can create a project by selecting File > New Project > New R Project.
Alternatively, you can download the R library usethis
and run the command create_project("~/…")
in the R console with full path to where you want to store your project:
Step 4: Configure git on your computer to be linked to your GitHub account and your R project
In a terminal, run the following commands with your Github username and user email:
git config –global user.name “Mona Lisa”
git config –global user.email “mona.lisa@gmail.com”
Then, you will need to log into your GitHub profile in a web browser and request a personal access token. For more instructions, you can use the command usethis::gh_token_help()
or check out this link. Once you have copied your token, run the commands in your R console:
::use_git()
usethis::create_github_token() # This opens up the web browser to get the token
usethis::gitcreds_set() # This prompts you to enter token in R console gitcreds
Output:
> usethis::create_github_token()
☐ Call gitcreds::gitcreds_set() to register this token in the local Git
credential store.
ℹ It is also a great idea to store this token in any password-management
software that you use.
✔ Opening URL
<https://github.com/settings/tokens/new?scopes=repo,user,gist,workflow&description=DESCRIBE
THE TOKEN’S USE CASE>.
> gitcreds::gitcreds_set()
? Enter password or token: ghp_RbBhrjov0LUP095ogHJTbMv1z7u5d5h0Ku7ZT
-> Adding new credentials…
-> Removing credentials from cache…
-> Done.
Step 5: Upload your project to GitHub
::use_github(private=TRUE) # This command turns your R project on your local computer to a GitHub project accessible on your web browser; option private=TRUE makes it private, default is FALSE usethis
Output:
> usethis::use_github(private=TRUE)
/R-in-Production-Workshop-positconf2024”.
✔ Creating GitHub repository “my_github_user_name
✔ Setting remote “origin” to/R-in-Production-Workshop-positconf2024”.
“my_github_user_name/main” as upstream
✔ Pushing “main” branch to GitHub and setting “origin
branch.<https://github.com/my_github_user_name/R-in-Production-Workshop-positconf2024>. ✔ Opening URL
§2.2: Overview of the Basic Workflow
create_project(“PATH_TO_FILE”)
use_git()
use_github()
§3: .Rprofile Files, DESCRIPTION Files, and .json Manifests
§3.1: Setting up your .Rprofile
“.Rprofile files are user-controllable files to set options and environment variables.”
Source of Quote (Accessed 08/17/2024): Posit Support Guide “Managing R with .Rprofile, .Renviron, Rprofile.site, Renviron.site, rsession.conf, and repos.conf”
.Rprofile can be created at the user or project level. You can use the command use_usethis()
to suggest a code chunk to put in your .Rprofile file (I believe it automatically opens your user-level profile, but search online for how project-level .Rprofile can be accessed using source()
and usethis
). If you add this chunk it suggests to your .Rprofile, usethis
is loaded each time.
::use_usethis() # COMMAND TO RUN usethis
Output:
usethis::use_usethis()
☐ Include this code in .Rprofile to make usethis available in all interactive
sessions:
if (interactive()) {
suppressMessages(require(usethis))
}
[Copied to clipboard]
☐ Modify /Users/username/.Rprofile.
☐ Restart R for changes to take effect.
Another good command:
# Tell RStudio not to save/reload sessions
:::use_blank_slate()
usethis# (Positron does this by default)
§3.2: Setting up your DESCRIPTION
“The DESCRIPTION file provides overall metadata about the package, such as the package name and which other packages it depends on.”
Source of Quote (Accessed 08/17/2024): R Package (2e) Metadata Webpage
This file type is special thing about R - this DESCRIPTION file is important for reproducability and record-keeping. Understanding how to read these files is a good skill.
::use_description()
usethisuse_package("rmarkdown") # Adds rmarkdown to Imports field in DESCRIPTION
§3.3: Manifest files
“Using manifest.json, you specify basic metadata about your extension such as the name and version, and can also specify aspects of your extension’s functionality (such as background scripts, content scripts, and browser actions)”
Source of Quote (Accessed 08/16/2024): MDN Web Docs Webpage
rsconnect:Write_Manifest()
writes the manifest file- .json file is like YAML file
§4: Publication
§4.1: Posit Connect Cloud
“Posit Connect Cloud is a new online platform to simplify the deployment of data applications and documents.”
Source of Quote (Accessed 08/18/2024): Connect Cloud User Guides: Posit Docs
“Posit Cloud lets you access Posit’s powerful set of data science tools right in your browser–no installation or complex configuration required.”
Source of Quote (Accessed 08/18/2024): Friction free data science - Posit Cloud
- Right now, this is free and easy to set up using your GitHub account
- The main use is of this is for publishing shiny apps
- With GitHub actions, you have to set up a lot of your package downloading (and version control) manually, but with Posit Connect Cloud it basically does all of this for you and with an easy UI experience
- The downside is that it can be tricky to troubleshoot errors if you get them
§4.2: GitHub Pages
Here are some additional commands than can aid with website publication using GitHub Pages:
use_github_pages()
use_github_action(url = "https://github.com/posit-conf-2024/r-in-production/blob/main/render-rmd.yaml")
§5: Locating and Logging Errors
Common occurrence in data science: “The solution to your programming issue is not proportional to the amount of time you spend debugging it.”
In this section we will talk about tools to efficiently track and log errors.
5.1: rlang
package
“A toolbox for working with base types, core R features like the condition system, and core ‘Tidyverse’ features like tidy evaluation.”
Source of Quote (Accessed 08/18/2024): CRAN Package: rlang
- My understanding is that a lot of tidyverse functions use
rlang
to report error sequences already, and that base R reports errors differently (you have probably noticed this variation from error to error when getting errors using different packages) rlang::traceback()
prints an error sequence when you create nestedfunction(){}
; this sequence is better in some ways thantraceback()
in base R because it prints something that is more in line with the way R works (EX: hierarchical if nested functions)rlang::global_entrance
gives you the same display of errors everywhere; posit packages likedplyr
automatically use thisprint(rlang::traceback)
is a way to help you see exactly where an error occurred (adding a few of throughout a script withfunction(){}
can help you find it)
5.2: Logging to stderr()
files
“In programming, logging refers to the process of recording or storing information or messages that help in understanding the behavior and execution of a program. It is a way to track the flow of the program and identify any issues or bugs.”
Source of Quote (Accessed 08/18/2024): [JavaScript] - What does it mean to log something in - SheCodes
“stdin()
, stdout()
and stderr()
are standard connections corresponding to input, output and error on the console respectively (and not necessarily to file streams). They are text-mode connections of class”terminal” which cannot be opened or closed, and are read-only, write-only and write-only respectively.”
Source of Quote (Accessed 08/18/2024): R Documentation for Display Connections
Here is a great quote from Reddit user u/aioeu on the r/bash subreddit:
“Programs are run with three standard streams.
Standard input (file descriptor 0) is the stream from which the program normally reads input. Most of the time that is connected to your terminal, so yes, most of the time what you type will be received over this program over this stream. But you can feed other things into standard input. For instance, you can redirect it to a file (so the program reads from that file instead of the terminal), or you can connect a pipe to it (so the program reads the output of some other program).
Standard output (file descriptor 1) and standard error (file descriptor 2) are two streams over which the program can produce output. Most of the time they are both connected to your terminal, so most of the time the output will be displayed there, but again you can change this by redirecting them to a file or to a pipe. By convention, programs should produce warning and error messages on standard error, and 'all other output' on standard output. This means you can redirect standard output to a file, say, yet leave standard error connected to the terminal so you can continue to read the error messages produced by the program.
<
and>
are redirection operators for your shell. You use this to redirect standard input and standard output respectively.”
In R you can use the following command to log into your output files:
# If you add this, it will put it in an “out” file; so you can add text that tells you how some process is going as it’s going
cat(“\n\n\nTEXT\n”, file= stderr())
# another function you can use in cat() to log, but cat() already prints it
sprint()
Logging on a Linux HPC That Uses SLURM Workload Manager:
- The bash command
R CMD BATCH
in a SLURM script creates an .Rout file which only has the output from the commands run in the script. R CMD BATCH
saves history into .Rdata file unless you specifyquit(save="no")
at the end of your Rscript- The bash command
Rscript
will create a standard output (.o) and standard error (.e) files - A .log file contains more info about usage patterns, activities and operations within an OS and application as well as the program (Source)
What are some things I should log?
- Adding emojis can help add color to spot things because .log files can get long; also, might not be possible on Linux HPCs because they tend to have limited font libraries
- The deployment system you use (GitHub actions or Posit Connect Cloud) will capture your comments when you use
stderr()
- One nice thing to add is a “Done!” statement to know when a task has finished executing
- Messages are worth removing because it gets rid of it in log file
options(warn=1)
makes them appear immediatelyoptions(warn=2)
makes the warnings into errors
There are many libraries you can use for logging. These are just a few options:
library(logger)
:
log_threshold(WARN)
log_info()
log_warn()
library(debugme)
:
debugme::debugme()
is better if you care about performance- You don’t want 90% of your code to be spent printing logging; it’s a trade off, but if you use it efficiently it can help you troubleshoot code that you were developing at one point
library(logrx)
: Here is a tutorial on how to log using it
Additionally,
- Github actions has expanders
- Most Posit products has a more log type display that’s super long
- You can use parseable variables in log
§6: Authentication
Authentication is important to ensure data and products remain secure. There are two main categories of authentication that often come up in data science and work in academia:
Encrypted Environment Variables: This essentially storing randomized, personalized strings in such a manner that work as “keys” to access information
Federated Authentication: Uses trusted identity providers
- IT oftentimes develops these or institutions pay for providers
- For example, Posit Connect has options for federated authentication
We will focus briefly on encrypted environment variables:
- On GitHub, you can go to Settings > Secrets and Variables > Actions
- There you can add secret Repository or Environment variables
- WARNING:
usethis::edit_r_environ()
: you can run it on your own laptop; but just so you know someone can use this to get into you account so it is NOT recommended - WARNING:
sys.setenv()
: it will save to your .Rhistory file which is easily accessible by other people - The goal is for these to not ever appear in
sys.getenv()
Key Takeaway:
Understanding the structure of authentication at your institution can help you troubleshoot authentication errors efficiently. Oftentimes, academic institutions have multiple sectors of IT dedicated to different aspects of authentication - knowing who to contact about what issues is an important skill to develop.
§7: Running Code Repeatedly
The challenge is that you have to think about all the things that can go wrong and all the things that will change over time.
Some General Categories of What Could Change:
- The data itself can change
- The schema of the data can change (EX: changing coordinate system from 0-based to 1-based, reporting positive strands to any strand)
- Variable names can change
- The content of the variables can change
- Units can change (Farenheit to Celsius)
- Packages might update or change
- Chip architecture can change, OS can change, or python/R might change
- Also, the “universe” changes; every model you fit is intended to describe the universe… and sometimes there are things that the data does not capture (EX: We might start to view things differently in genetics that affects the fundamentals of how we collect and analyze genetic data)
In this workshop, we focused on
- Platform
- Packages
- Schema
§7.1: Platform
- As a data scientist, you should know what a container is
- You don’t need to know how to write them yourself but the basic solution to this is to use containers
§7.2: Packages
There are 4 mitigation strategies here:
- Make sure you have the right versions
- Capture versions as a part of deployment
- Using a project specific library
- The fewer dependencies you have, the stronger you code is (but it’s typically okay in a dev phase)
Posit Package Manager
Currently has a free version and a non-free version; this service allows for ways to document and efficiently load packages.
renv
Package:
- This is a good package; this can help you with having a location for R libraries for each project; it’s typically good to check these things like every 6 months or so because after a few years packages can change drastically
renv::renv_lockfile_from_manifest(“manifest.json”, “renv.lockfile”)
renv::restore()
renv::update()
renv::deactivate()
renv::dependencies()
purr
Package:
- There is a way using
purr::setnames()
;purr::map(\(this_pkg)
to see if all of the R packages that you use that - Warning is like “you really need to fix this”, and it’s good to try to get rid of it
conflicted
Package:
conflicted
package is useful for when a function that is in two different packages is referenced; it gets rid of the messages with the conflicted packages but when you try to use the functions it will tell you where it is conflicted and it asks you which one you want to use
pak
Package:
“pak
installs R packages from CRAN, Bioconductor, GitHub, URLs, git repositories, local files and directories. It is an alternative to install.packages()
and devtools::install_github()
. pak is fast, safe and convenient.”
Source of Quote (Accessed 08/17/2024): GitHub Webpage for pak
Package
§7.3: Schema
- Oftentimes this means you need to keep in communication with the people who make the data
- This is hard but it can help a lot; also generally being accessible is a good habit in the sciences
- Having code that checks to make sure that the format is correct is important
pointblank
package is great for thispointblank::export_report()
it makes an r markdown file that just tells you a bunch about your datacol_is_numeric()
- There is an
anti_join()
which tells you what doesn’t join, and code was shared with us that basically allows us to make sure the - Data quality reporting; we need to make reports on all of these “suspicious things”
- Maybe something like production_reports/quality_reports
- You should have ways in your code to report if the places you are pulling data from (EX: Open-source databases like JASPAR or the UCSC Genome Browser)
§9: Final R Coding Tips and Factoids
These are some final comments and tips that I have noted but haven’t included in other sections.
- When you create an R project, a directory
/R
will automatically be made; When you have repeated functions, you should put those into scripts and those can go in the R directory, and scripts that actually execute tasks should be in a separate file; the “Rule of Three” for making functions can be used here: if you copy and paste a chunk of code more than 3 times within a project, it should probably be it’s own function) - Make sure your deployed code links to your product and vice versa; it needs to be clear how everything connects together (not just to you, but to anyone who will be using your code or seeing your products)
- An R package can be thought of like a book in a library; when you use the
library()
function in R to load a package, it can be thought of like checking a book out of the library - One library for each project can be a good idea for certain projects if storage allows
- Batching and time boxing are valuable techniques to keep progress moving: in other words, instead of saying “I’m going to fix this”, you can say “I am going to give this two weeks”
- Keeping a log of things you are going to or want to fix can be a good way for you to remember ways to improve code; but at the same time, there might be things that a year down the line you realize “oh I didn’t fix that and it clearly seems like I don’t need to”