class: center, middle, inverse, title-slide # Reproducibility in R ## Intro to efficient Data Pipelines ### Michael Jones ### 2021-07-20 --- class: center, middle, inverse # What is<br>**Reproducibility**? --- class: middle, center # Someone else can<br>re-run your process<br>and get the same results --- class: middle, center # Someone **(else)** can<br>re-run your process<br>and get the same results --- class: middle, center # Someone **(else)** can<br>**re-run** your process<br>and get the same results --- class: middle, center # Someone **(else)** can<br>**re-run** your process<br>and get the **same results** --- class: center, middle, inverse # What does a<br>reproducible process<br>**look like**? --- class: center, middle # Well documented --- class: center, middle # Non-interactive --- class: center, middle # Keyboard-based --- class: center, middle # Structured consistently --- class: center, middle # Friendly --- class: center, middle # Extendible --- class: center, middle, inverse # The **scale**<br>**of Reproducibility** --- # Stage 0 -- - Doing it and not telling anyone -- - Hand Made Artisanal Analysis --- # Stage 1 -- - Pen and Paper -- - Excel -- - Point and Click (Mouse work) -- - Doing it in R then not saving your work --- # Stage 2 - Doing it in R **and** saving your work, but still not having any structure --- # Stage 3 - Scripts like ``` 01_load_data.R 02_fit_linear_model.R 03_fit_gam.R 04_model_summaries.R 05_plots.R 06_paper.R ``` --- # Stage 4 - Process defined **in code** - Using a system that knows about **dependencies** - With a define **build process** - That handles storage of results **for you** - End to end: from **data to report** --- # Stage 5 - Virtual Environments - Containerisation (e.g. Docker) - Virtual Machines --- # ~~Stage 5~~ - ~~Virtual Environments~~ - ~~Containerisation (e.g. Docker)~~ - ~~Virtual Machines~~ # Out of scope today --- class: inverse, middle, center # What does reproducibility<br>**not** look like? --- # In R ```r df %>% filter(col < value) %>% ... ``` - No clear declaration of libraries - No evidence of how we got `df` in the first place. --- # In R ```r setwd("path/to/firectory/that/only/exists/on/my/machine/") ``` - Use shared resources (e.g. databases) - Use paths relative to the project root - Use {here} --- # In R ```r rm(list = ls()) ``` - NO --- class: center, middle, inverse # All Analysis<br>is a **DAG** --- - **Graph** stages of your analysis are *connected* somehow - **Acyclic** there is an order: earlier results feed into later results *no loops* - **Directed** there's a *flow* from start to finish --- class: center, middle, inverse # What does Reproducibility<br>**feel** like? --- class: center, middle # "I have no idea how we did that..." -- # **"This thing from 2 years ago makes perfect sense"** --- class: center, middle # "... Oh no, that change in data is going to put another three weeks on the deadline" -- # **"No problem, we'll have updated results with you by lunchtime"** --- class: center, middle, inverse # Example --- # The {targets} package - Structures the analysis as a **data object** - On changes, only rebuild **downstream from the change** - Store all intermediate results out of the way - Plays nicely with Rmarkdown