Keeping Track With Pipes

It’s 3am and you hear banging on your front door. You go downstairs to see who it is. It’s the Data Police, and they don’t look happy.

“What exactly did you do to get that data.frame? You need to tell us right now and you need to be accurate”

It’s an occurrence that we’ll all experience sooner or later in our careers as data professionals, and so it’s useful to be able to design systems that keep track of what transformations we’re applying to our data.

There are a few ways, ranging from basic and verbose, to a little cleaner.

The basic idea

We want a way of looking at the end product of an analysis and seeing how it was made. We want to be able to construct its history, from the object itself and not through reading back all the code. I’d say the latter is an almost-solved problem: in R, use clear code, pure functions and the {targets} package and you should be most of the way there.

Today, though, we’re playing around with implementing the former.

We’re going to have as our example this complex piece of data analysis:

library(tidyverse)
library(palmerpenguins)

penguins %>%
  mutate(bill_area = bill_length_mm * bill_depth_mm) %>%
  group_by(species) %>%
  summarise(bill_area = mean(bill_area, na.rm = TRUE))

## # A tibble: 3 × 2
##   species   bill_area
##   <fct>         <dbl>
## 1 Adelie         713.
## 2 Chinstrap      902.
## 3 Gentoo         714.

And we want to end up with some sort of log of the steps we’ve taken at each stage.

First Attempt

We’ll start with keeping our logging information in a list.

Out first attemp my be to try to re-write some of the {dplyr} functions to take a log and a data.frame and work from there. We can (ab)use the ... arguments of R functions to accept them and pass them through to the actual dplyr function:

log <- list()

logged_mutate <- function(df_and_log, ...) {
  
  new_log <- append("mutate", df_and_log$log)
  new_df <- mutate(df_and_log$df, ...)
  list(df = new_df,
       log = new_log)
}

logged_group_by <- function(df_and_log, ...) {
  new_log <- append("group_by", df_and_log$log)
  new_df <- group_by(df_and_log$df, ...)
  list(df = new_df,
       log = new_log)
}

logged_summarise <- function(df_and_log, ...) {
  new_log <- append("summarise", df_and_log$log)
  new_df <- summarise(df_and_log$df, ...)
  list(df = new_df,
       log = new_log)
}

tmp <- logged_mutate(list(df = penguins, log = log),
                     bill_area = bill_length_mm * bill_depth_mm) %>%
  logged_group_by(species) %>%
  logged_summarise(bill_area = sum(bill_area, na.rm = TRUE))

tmp

## $df
## # A tibble: 3 × 2
##   species   bill_area
##   <fct>         <dbl>
## 1 Adelie      107654.
## 2 Chinstrap    61335.
## 3 Gentoo       87779.
## 
## $log
## $log[[1]]
## [1] "summarise"
## 
## $log[[2]]
## [1] "group_by"
## 
## $log[[3]]
## [1] "mutate"

but this has some draw backs:

We currently have no more information about the process applied, other than some mutate happened
We have to wrap all of the dplyr functions in a new logged_* function, which means that we can only apply logging to functions that we make the effort to enable logging with.
Fundamentally here, our analysis functions are becoming analysis and logging functions. They are no longer doing a single job since they have to also wrap and unwrap the log.

Aside: Abstraction Possiblity

While it’s not the solution we’re going to end up with, it’s worth noting that we can go a little further down this path by making a function factory to produce these wrapped dplyr functions. We’re still going to have to define each function we want to use, but it will save a little typing and potential for copy/paste errors:

wrap_with_logging <- function(fun, fun_name) {
  function(df_and_log, ...) {
    new_log <- append(fun_name, df_and_log$log)
    new_df <- fun(df_and_log$df, ...)
    list(df = new_df,
         log = new_log)
  }
}


logged_mutate2 <- wrap_with_logging(mutate, "mutate")
logged_group_by2 <- wrap_with_logging(group_by, "group_by")
logged_summarise2 <- wrap_with_logging(summarise, "summarise")

tmp <- logged_mutate2(list(df = penguins, log = log),
                      bill_area = bill_length_mm * bill_depth_mm) %>%
  logged_group_by2(species) %>%
  logged_summarise2(bill_area = sum(bill_area, na.rm = TRUE))

tmp

## $df
## # A tibble: 3 × 2
##   species   bill_area
##   <fct>         <dbl>
## 1 Adelie      107654.
## 2 Chinstrap    61335.
## 3 Gentoo       87779.
## 
## $log
## $log[[1]]
## [1] "summarise"
## 
## $log[[2]]
## [1] "group_by"
## 
## $log[[3]]
## [1] "mutate"

This at least saves us some work, but we’ve still got the issues with functions having multiple purposes.

Modifying the Pipe

We really want a solution that abstracts out the process of wrapping and unwrapping the logging infrastructure so we can separate it from the actual analysis. We’re going to do that by modifying the %>% pipe.

Confession time: I don’t really know how the magrittr pipe works, but that’s not going to stop us. Using one of my favourite features of R, let’s look at the internal code:

magrittr::`%>%`

## function (lhs, rhs) 
## {
##     lhs <- substitute(lhs)
##     rhs <- substitute(rhs)
##     kind <- 1L
##     env <- parent.frame()
##     lazy <- TRUE
##     .External2(magrittr_pipe)
## }
## <bytecode: 0x555d6ece7fc8>
## <environment: namespace:magrittr>

Okay, the first two lines quotes the first and second arguments to the function respectively¹.

The next line sets a variable to 1. This is one of those lines that’s magic to me: I’ll just leave it there and hope all works.

The next line defines a variable to point to the parent environment. R lets you pass around code and evaluate it in different environments. Each function has its own environment that’s just for the body of the code, but each function exists in a different environment, it’s “parent environment”, it’s parent.frame().

.External2 is apparently “an enhanced version of .External which passes additional information in more arguments”. .External is a function “to pass R objects to compiled C/C++ code that has been loaded into R”. So I suppose that magrittr is doing something behind the scenes with some compiled code. Maybe I’ll look into that one day, but not today. We’ve got all we need for now.

We’re going to take the %>% source code and adapt it to our needs which are:

We want our new pipe to be able to handle a list with the data.frame and the log on the left
We want it to handle all the unwrapping, re-wrapping and log processing for us
We want it to work without us having to change any of the dplyr functions.

Since we’re going to be wrapping and unwrapping the log, we may as well write a function to handle the wrapping for us:

wrap_with_log <- function(df, log = list()) {
  if (missing(log)) {
    log <- substitute(df)
  }
  list(
    df = df,
    log = log
  )
}

The if just checks for the case that a log isn’t given. This is for convenience so we can wrap_with_log(penguins) at the start, without having to supply a log.

Our new pipe just needs a few adaptations from {magrittr}’s, and we’ll call it %<$>%:

`%<$>%` <- function(lhs_logged, rhs) {
  lhs <- lhs_logged$df
  log <- lhs_logged$log
  lhs <- substitute(lhs)
  rhs <- substitute(rhs)
  new_log <- append(log, rhs)
  
  # regular pipe stuff
  kind <- 1L
  env <- parent.frame()
  lazy <- TRUE
  a <- .External2(magrittr:::magrittr_pipe)
  
  wrap_with_log(a, new_log)
}

A few things to note:

This does no error checking, i.e. it will just fail ungracefully if the lhs_logged argument isn’t a list of the expected type.
We’re taking advantage of the fact that %>% quotes the input to get better logs (see where we just append the rhs to make new_log).
We’re running the normal pipe, but saving it to a and then wrapping with the log before we return it at the end of the function.

All in all, this has let us do:

analysis <- wrap_with_log(penguins) %<$>%
  mutate(bill_area = bill_length_mm * bill_depth_mm) %<$>%
  group_by(species) %<$>%
  summarise(bill_area = mean(bill_area, na.rm = TRUE))

So we’ve got analysis now, that’s a list() of two items.

The output data:

analysis$df

## # A tibble: 3 × 2
##   species   bill_area
##   <fct>         <dbl>
## 1 Adelie         713.
## 2 Chinstrap      902.
## 3 Gentoo         714.

And the log:

analysis$log

## [[1]]
## penguins
## 
## [[2]]
## mutate(bill_area = bill_length_mm * bill_depth_mm)
## 
## [[3]]
## group_by(species)
## 
## [[4]]
## summarise(bill_area = mean(bill_area, na.rm = TRUE))

And we can get back code that would reproduce the analysis with this function:

get_code <- function(df_with_log) {
  df_with_log$log %>%
    map_chr(deparse) %>%
    paste(collapse = " %>% ")
}

get_code(analysis)

## [1] "penguins %>% mutate(bill_area = bill_length_mm * bill_depth_mm) %>% group_by(species) %>% summarise(bill_area = mean(bill_area, na.rm = TRUE))"

Which is actually valid code and can be run:

eval(str2lang(get_code(analysis)))

## # A tibble: 3 × 2
##   species   bill_area
##   <fct>         <dbl>
## 1 Adelie         713.
## 2 Chinstrap      902.
## 3 Gentoo         714.

Extensions and Future Work

I’m not sure whether this has legs, but if I were to use it properly, I’d want to do the following things:

More defensive programming, e.g. something to handle situations with badly formed arguments, errors, etc.
Tidier ways of recovering the code etc
A better format for the log itself and the data wrapped in a log e.g.
- the log could be some S3 class
- the log could be kept in the attributes of the data. Maybe; there are benefits to both approaches.

I’m going to have a think and see if this can be fleshed out a little more.

Remember that in R, any 2-argument function defined inside %s can be used in “infix” mode, as if it were an operator. It’s customary then to refer to the arguments as lhs and rhs, i.e. the one on the left hand side of the operator and the one on the right hand side↩︎