Introduction
There’s a certain joy that comes with having systems and tools work well. I know: It’s not really about the tools, it’s about what you make with the tools. But still, there’s a reason that a woodworker will make so many ‘shop projects’, a reason that machinist will construct so many of their own tools. When you tinker with something, you develop an appreciation for incremental gains in tooling.
This post is a little like that.
The Situation
You have a process that can be well defined in general terms. For example:
- Load the data
- Verify the data
- Clean the data
- Manipulate the data
- Write to the correct database table
But you need to do the same process to several different types of data. Perhaps you have files that are of slightly different formats that each need their own specific treatments. Perhaps the ultimate destination of the data is slightly different depending on certain aspects of the data.
You need to build a way to manage all these different considerations when setting up your process.
There are many ways to do this, and I’ll cover a few below, but the one I’ve been using the most recently is by allowing R’s S3 object model to do all the routing for me.
My Examples
I have two examples that I’m using this technique for at the moment.
- An analysis that involves reading around 10 different data sets that all represent the same sort of thing (specifically policy and claims information for an insurance company), but each of the ten sets all come from a different source1. I want to end up doing the same set of general steps to each data set but the specifics are different.
- A different analysis, this time dealing with two types of datasets from each of three different sources.
The data represents either snap shot views of an insurance company’s portfolio of lives, or movement data between two points (which policies have been opened, lapsed, or closed and at what date).
Again, the format of the files is different for each provider, but I’m going to be doing something similar to each type of file regardless of which provider it comes from.
In both of these examples I want to be able to control the details of what each processing step involves for each type of file separately from controlling the way each step fits together.
I want to deal with my program at two distinct levels:
- The nitty-gritty details of reading and transforming specific sets (“tree mode”)
- The higher-level overview of the process as a whole (“forest mode”)
The most basic solution
The most basic thing is just to write different functions for different pieces. If there’s any overlap, you can use the same function, but when you’ve got something specific, just make it specific.
As an example, let’s say that we’re dealing with the first scenario above. But let’s limit ourselves to three different providers, “A”, “B”, and “C”.
We need to make:
- Data loading functions for all three types (because they are all slightly different formats)
- Data verification functions for all three types (again, different formats mean different things to verify)
- Data manipulation functions.
- Overall process management
We could do the following:
load_type_A <- function(filepath) {
readr::read_csv(filepath, col_types = cols(...))
}
load_type_B <- function(filepath) {
readr::read_csv(filepath, col_types = cols(...)) # but different column types
}
load_type_C <- function(filepath) {
readxl::read_xlsx(filepath) # Maybe this one always comes in .xlsx
}
verify_type_A <- function(df) {
# specific verification steps
# perhaps we use {checkmate}, {pointblank}, or {assertr} or something
}
verify_type_B <- function(df) {
# Same idea, but different steps
}
# ... and so on
Then when it comes to our processing we might have a processing function for each type:
process_type_A <- function(filepath) {
load_type_A(filepath) |>
verify_type_A() |>
manipulate_type_A() |>
store_type_A()
}
process_type_B <- function(filepath) {
load_type_B(filepath) |>
verify_type_B() |>
manipulate_type_B() |>
store_type_B()
}
process_type_C <- function(filepath) {
load_type_C(filepath) |>
verify_type_C() |>
manipulate_type_C() |>
store_type_C()
}
Great!
But what if we need to change something about the processing. We’ve already said we want to do the same general thing to each file, so if we want to change the process, we need to change three things2.
Or if we want to add another type, then we’d have to write a whole new set of functions.
What if our new type was basically exactly the same as “C”, but just needed a tweak to the verification step?
Option 2: Loads of if
s
Here we write code that handles our options for us. We have some overview process and then handle the specifics with conditionals:
my_process <- function(filepath, type) {
# load
if (type == "A") {
data <- load_type_A(filepath)
} else if (type == "B") {
data <- load_type_B(filepath)
} else {
# ...etc
}
# verify
if (type == "A") {
verified_data <- verify_type_A(data)
} else {
# ...etc
}
# ...etc (rest of process)
}
This would mean that we can just load all our files into a single data structure and map or loop over them, e.g.
my_data <-
tibble(filepath = c("file1", "file2", "file3", "file4"),
types = c("A", "B", "A", "C"))
my_results <- purrr::walk2(
my_data$filepath,
my_data$type,
my_process
)
This solves our interface at a user level, i.e. the user just has to say where the file is and what type it is, and then always calls the same my_process()
function.
Nice and easy for them.
However, dealing with loads of if
s swiftly becomes tedious.
So much of our code will be the conditional scaffolding that it will be easy to lose track of where the right place to make changes is.
It makes for code that’s a little more verbose than we want, and since it’s all just if
conditionals, then it makes things a little more fragile.
My S3 Solution
Essentially what we’ve been trying to achieve so far is building a system that will work a little bit like a dispatcher at at taxi company (bear with me). Calls come in to the central office, and the dispatcher routes jobs out to the correct taxi driver.
Only so far, we’ve been making the drivers handle the dispatching. The functions we’ve written do both the job of running the process, but also the job of managing what type of object they are working on. Whether this is in the first instance where we’ve explicitly tagged each function name with the right type, or in the second example where we’ve had to embed the decision logic right in with the process logic.
In the taxi office example, the dispatcher is responsible for sending jobs where they need to go, but doesn’t need to actually drive anywhere. And the drivers just need to deal with their own passengers and not with routing jobs.
The S3 object system allows us to build some infrastructure to do just this.
Crash Course in S3
In short, S3 is what lets you call summary()
on a linear model and get suitable output that’s different from what you get if you call summary()
on a data.frame.
In a way, summary()
is smart enough to know that when you call it on a linear model, you want some specific action that relates to linear models and not data.frames.
R achieves this by having linear models, data.frames and many other things be assigned one or more classes.
The summary()
function is what’s called a generic function.
I.e. a function that is written to be applied generically to many different types of things.
When you call summary()
on a linear model, R will check what class your linear model is, e.g.
my_model <- lm(mpg ~ hp, data = mtcars)
class(my_model)
## [1] "lm"
R knows that my_model
is of the “lm” class.
R then calls a method of the summary()
generic that is specific to the class “lm
”, namely, summary.lm()
.
There’s loads of other methods, e.g. summary.data.frame()
.
The naming convention is generic.class()
.
S3 dispatch is what is going to handle our data flow.
Making a Generic
Back to our three type example.
Let’s start by designing at the forest level. We want our process to be something like:
process <- function(binder_data) {
load(binder_data) |>
verify() |>
manipulate() |>
store()
}
Note that this (superficially) looks almost exactly the same as our initial type-specific implementations, but we’ve taken out the “_type_A
” suffixes to our functions.
We will be able to do this, because each of these functions, load()
, verify()
, manupulate()
, and store()
won’t be normal functions, but instead will be generics.
Before we proceed in fleshing out the specifics, we need to tell R that these are generics. We do this using the following code:
load <- function(x, ...) {
UseMethod("load")
}
# similar definitions for the others
Some key points:
- the generic definitions don’t include any implementation: that comes later when we implement the methods
- the argument to
UseMethod
should match the name of the generic - the arguments included in the
function()
define the minimum set of arguments that each method should also accept. It’s common to include...
in the arguments as then specific methods can accept more than this minimum set
One more peice of scaffolding
I want a way to make a general abstract class that says an object is of the type that I want to process without nailing down what type it is.
Let’s stick with our prior example for which I want a general “binder_data
” class that any binder data belongs to regardless of its type.
Luckily S3 objects’ class attribute can actually be a vector, e.g. tibble()
makes tibble objects with the following class:
class(tibble::tibble(mtcars))
## [1] "tbl_df" "tbl" "data.frame"
Calling any generic on this object will make R first check to see whether there’s a tbl_df
method, if there is, then that method will be used, if there isn’t, R will check whether there’s a tbl
method, failing that R will use the data.frame
method.
So we can use this feature to set up objects like:
"A" "binder_data"
"B" "binder_data"
"C" "binder_data"
That is, three different classes that represent each type, but all of them are also of the “binder_data
” class.
We do this so we can have methods that work at the specific type level or functions that work on a general binder_data
level for when we want to do the same thing to all types.
We could make a constructor function like so:
binder_data <- function(type, filepath) {
output <- list(
data = NULL,
filepath = filepath
)
class(output) <- c(type, "binder_data", class(output))
output
}
We do two main things here:
- construct a list called
output
in which to store both our data and our filepath information - change the
output
list’s class to be the type, “binder_data
” and then whatever class it was before (in our caselist
). This last bit makes R use whateverlist
methods exist where we haven’t defined abinder_data
method.
Now if we have a file path for a file that’s of type A, we can set up the input to our process as so:
example_file_1 <- binder_data("A", "path/to/file.csv")
class(example_file_1)
## [1] "A" "binder_data" "list"
And we can do similar for files of type “B” etc.
Currently we’ve not made any limits on what the type
argument could be and we’ll discuss that later.
Our first method
Now we have objects of class binder_data
and type, we want to make some methods for these.
For example, a load
method.
Recall that we have already defined load
as a generic.
We now need to make it work on objects of type “A
”, “B
” etc.
load.A <- function(x) {
x$data <- readr::read_csv(x$filepath)
x
}
We know3 that any object of type A
has been made with binder_data
and so has a data
value and a filepath
value.
We can then make a load
method for objects of type C:
load.C <- function(x) {
x$data <- readxl::read_xlsx(x$filepath)
x
}
We can do the same for all the other steps in the process and all the other types of file.
By setting ourselves up with S3 classes, R will be handling all the dispatching of methods for us.
Organisation
But this still makes a lot of code, but luckily it’s fairly modular. We can also respect the two different levels of our processing. I’d recommend something like this file layout:
processing.R
holds the definition ofprocess()
and anything else that’s ‘forest level’generics.R
holds the definition of all the generic functions likeload()
andmanipulate
, also holds the definition ofbinder_data()
.methods_A.R
holds all the methods specific to the files of type A (a ‘tree level’ file)methods_B.R
holds all the methods specific to the files of type B- etc with a new file for each new file type
This keeps our work nice and neat and makes it obvious where we need to make changes, e.g.:
- if you want to change the flow of the overall process you change
processing.R
. - if you want to add a new file type, make a new
methods_x.R
file.
Dealing with the same action
When you have a processing step that’s the same across all file types, you don’t need to copy and paste the same code for each type.
i.e. you don’t need to make step.A()
and step.B()
etc, if step()
will be implemented the same for all types.
Rather, just make a step.binder_data()
and this method is the one that will be called regardless of the file types.
That is, when you call step()
on your object, R will look at your object of class "A" "binder_data" "list"
.
It will look for step.A()
but it won’t find any definition, and so it will then look for step.binder_data()
and apply that.
Now, say you have ten different file types and nine of them need the same processing but the tenth needs something special.
Simple, just implement a step.binder_data()
that has the common implementation and then make a step.J()
for the tenth file type that’s a problem.
S3 dispatch will handle the rest.
What if someone tries to make an unexpected type?
This also lets us handle the case when someone accidently makes a binder_data
object with a type you haven’t implemented yet.
Say you have load.A()
, load.B()
and load.C()
all implemented, because you think that covers the problem domain.
But then someone comes along and runs:
my_awkward_thing <- binder_data("D", "path/to/file.csv")
They won’t be able to proceed since there is no load.D()
and there is no load.binder_data()
.
If you’re happy that the following error tells them all they need to know, then you don’t need to do any more. The S3 dispatch infrastructure means that unexpected things are thrown out at an early enough opportunity.
load(my_awkward_thing)
## Error in UseMethod("load"): no applicable method for 'load' applied to an object of class "c('D', 'binder_data', 'list')"
But if you want to be a little more helpful, and there’s no generic process that you want to apply to all binder_data
objects, you could implement a fall back method that would be called on all binder_data
objects with an unrecognised file type.
For example:
load.binder_data <- function(x) {
stop(paste0("`load` is not implemented for file types '", class(x)[[1]], "'"))
}
load(my_awkward_thing)
## Error in load.binder_data(my_awkward_thing): `load` is not implemented for file types 'D'
Conclusion
S3 may not be the best dispatch method for your job. It could either be overkill or even underpowered. You might be fine with writting loads of different specific functions, and you may need the more constrained power of S4, or something available in a different language altogether.
Nevertheless, I think that S3 is such a little step up developer learning that gives you massive gains in flexibility of your programming that it’s worth a place in your toolbox.
In my actual work examples. I’ve ended up following a similar framework and it’s worked well. It means that I’ve managed to keep the two layers of abstraction separate, have more modular code, and it’s easier to understand how everything fits together.
The insurance company operates under what’s called a “binder” or “binding agreement” whereby they let some other entity sell and manage policies on their behalf but they are the ones actually doing the insurance and so are entitled to detailed data. Each data set is from a different binder, hence a totally different style↩︎
recall, I have near ten things in my actual situation↩︎
We don’t technically know this for sure, as S3 is a very lose system, you can change the
class
of any object at runtime, and make R think something is whatever you tell it. Since R doesn’t have private constructors as well, you can’t know for sure that someone hasn’t made hand-constructed something unexpected. But while your users can do this, most won’t because they probably want to get work done. Bear that in mind when you’re designing something and decide whether R/S3 is the right fit for the problem↩︎