Introduction
I love open source software. I love the intentions behind it, I love the range of tools available and I love the community. But for me, the best thing about open source is spotting something impressive and being able to find out how it was done.
Reading someone else’s source code lets you learn directly from them. You get to see how they tackle a problem and can incorporate their techniques into your own code. I’m an okay programmer, I get by, but I want to be better, and reading other people’s code is a good way to do that.
In this post, I’m going to have a close look at some R code from the gt package1. This is one of the many R pacakges for generating publication-quality html tables and is written by Richard Iannone2. Iannone is for force behind some great R packages, and he writes some of the best documentation that I have ever seen. Seriously, it’s better than software I pay for.
gt
gt makes html tables. It stands for ‘grammar of tables’ and borrows ideas familiar to many R programmers of structuring an interface through the principles of a grammar. It’s exactly the concept underlying Hadley Wickham’s graphing package, ggplot2.
This structure to the design leads to packages containing small functions that do one thing, but can be composed into longer chains to achieve much more complext results3. It’s the same way that a ggplot2 graphic is built up from smaller pieces linked together; or more loosely it’s similar to how our natural languages build complex sentences from simple building blocks following a set of rules.
gt lets us go from this:
## # A tibble: 14 x 4
## # Groups: type, size [14]
## type size n_sold price
## <chr> <chr> <int> <dbl>
## 1 chicken L 4932 102339
## 2 chicken M 3894 65224.
## 3 chicken S 2224 28356
## 4 classic L 4057 74518.
## 5 classic M 4112 60582.
## 6 classic S 6139 69870.
## 7 classic XL 552 14076
## 8 classic XXL 28 1007.
## 9 supreme L 4564 94258.
## 10 supreme M 4046 66475
## 11 supreme S 3377 47464.
## 12 veggie L 5403 104203.
## 13 veggie M 3583 57101
## 14 veggie S 2663 32387.
to this:
size | n_sold | price |
---|---|---|
chicken | ||
L | 4,932 | $102,339.00 |
M | 3,894 | $65,224.50 |
S | 2,224 | $28,356.00 |
classic1 | ||
L | 4,057 | $74,518.50 |
M | 4,112 | $60,581.75 |
S | 6,139 | $69,870.25 |
XL | 552 | $14,076.00 |
XXL | 28 | $1,006.60 |
supreme | ||
L | 4,564 | $94,258.50 |
M | 4,046 | $66,475.00 |
S | 3,377 | $47,463.50 |
veggie | ||
L | 5,403 | $104,202.70 |
M | 3,583 | $57,101.00 |
S | 2,663 | $32,386.75 |
Data taken from the pizzaplace dataset of {gt} | ||
1
Includes Hawaiian pizzas
|
It can do a lot more, and I encourrage you to check out the documentation and examples here.
Where it all starts: gt()
All gt tables start by calling the gt()
function on a data frame.
This creates a gt_tbl
object, which is essentially a fancy list.
Classes in R are not quite the same as classes in, say, Python. And while the methods of making classes and structring inheritance are beyond this post4, very simply put, making an object of a specific class in R can be done just by telling R “hey, this thing is a $CLASS now”.
gt_table_1 <- gt(small_pizza, groupname_col = "type")
class(gt_table_1)
## [1] "gt_tbl" "list"
We can tell from this that while our object is a gt_tbl
object primarily, it’s based on a list
.
This is good for us: We can use all the normal tools for dealing with lists to get a better idea about this gt_tbl
object.
Let’s have a look what’s inside:
str(gt_table_1, max.level = 1)
## List of 16
## $ _data : tibble [14 × 4] (S3: tbl_df/tbl/data.frame)
## $ _boxhead : tibble [4 × 6] (S3: tbl_df/tbl/data.frame)
## $ _stub_df : tibble [14 × 3] (S3: tbl_df/tbl/data.frame)
## $ _row_groups : chr [1:4] "chicken" "classic" "supreme" "veggie"
## $ _stub_others : chr NA
## $ _heading :List of 2
## $ _spanners : tibble [0 × 4] (S3: tbl_df/tbl/data.frame)
## $ _stubhead :List of 1
## $ _footnotes : tibble [0 × 7] (S3: tbl_df/tbl/data.frame)
## $ _source_notes: list()
## $ _formats : list()
## $ _styles : tibble [0 × 7] (S3: tbl_df/tbl/data.frame)
## $ _summary : list()
## $ _options : tibble [133 × 5] (S3: tbl_df/tbl/data.frame)
## $ _transforms : list()
## $ _has_built : logi FALSE
## - attr(*, "class")= chr [1:2] "gt_tbl" "list"
So we see that a single call to gt()
has done quite a bit of work behind the scenes.
It’s taken our data.frame and put it into a structure that has places to put all of the formatting settings of a table.
There’s almost a one-to-one correspondence with most of the elements of a table in this diagram.
gt()
effectively sets up a form that we can then fill in to tell R exactly how we wants our tables to look.
Moreover, as each choice we make is stored separately in its own place in the list, we can change the settings in any order that we like and defer rendering to the end.
All the main functions from gt accept a gt_tbl
object as their first argument and return a modified version of that gt_tbl
object:
This means that gt is highly pipeable5, and we can build up complex tables in a series of small steps.
This is similar to the tidyverse/dplyr philosophy: all the functions accept and return the same object meaning that they can be used in any order to build up a complex pipeline from small, simple actions. A design principle that benefits both the author and the user of the software.
We saw earlier that we’ve got 16 items that can be altered using gt functions, including:
- boxhead
- stub
- heading
- spanners
- footnotes
- source_notes
Importantly, we also carry a copy of the underlying data in _data
:
all(gt_table_1$`_data` == small_pizza)
## [1] TRUE
This means that we haven’t lost the original data when we converted to a gt_tbl
and so we can also compute things from the data itself.
Totals and summaries, for example.
The gt()
function itself does a few main things, it:
- validates or corrects any arguments passed into
gt()
. - creates a list called
data
and then pipes this list through several initialisation functions, one for each of the elements of thegt_tbl
list. For example, there’s adt_data_init
function, and adt_source_notes_init
function - Sets the class of the list in the R way:
class(data) <- c("gt_tbl", class(data))
. This is the “hey R, this is now a gt_tbl object, thanks” part. - Returns the
gt_tbl
object.
Initialisation
The design principle here is one of a abstraction.
The gt()
function is concerned with making a gt_tbl
object, it’s not concerned with the actual details of getting it done, which it farms out to other functions.
Designing this way creates more functions to keep track of, but makes each individual functions eaiser to work with (and test).
Each of these *_init()
functions is responsible for setting up a part of the gt_tbl
object.
Most of them follow similar principles, so let’s just have a look at one: dt_footnotes_init
.
In the linked source file, we have four functions:
dt_footnotes_get()
dt_footnotes_set()
dt_footnotes_init()
dt_footnotes_add()
The first two each make ways to get and set the footnotes and are specialised versions of generic getters and setters. This is another example of the principle of abstraction: There are generic getters and setters that take two arguments; what to get and where to get it from. But when we know that we are dealing with footnotes, we can make a specific footnote getter that effectively already fills in the “where to get it from” bit and so only needs one argument.
The third function is the dt_footnotes_init()
function used in gt()
.
This function sets up a data structure to hold footnote information and then puts it where it needs to be.
This data structure is a tibble with seven columns wich places to put all the location and content information relating to footnotes.
By the end of all the initialisation steps, we end up with a nested data structure that has places for all our formatting choices to be recorded. The next step is to examine what happens to this data structure when we specify the formatting we want.
Adding a footnote
So far, the _footnotes
part of our table has been constructed, but is yet to be populated:
glimpse(gt_table_1$`_footnotes`)
## Rows: 0
## Columns: 7
## $ locname <chr>
## $ grpname <chr>
## $ colname <chr>
## $ locnum <dbl>
## $ rownum <int>
## $ colnum <int>
## $ footnotes <chr>
When we add some footnote information this part of the gt_tbl
object is updated:
gt_table_2 <- gt_table_1 %>%
tab_footnote(footnote = "Includes Hawaiian pizzas", location = cells_row_groups("classic"))
glimpse(gt_table_2$`_footnotes`)
## Rows: 1
## Columns: 7
## $ locname <chr> "row_groups"
## $ grpname <chr> "classic"
## $ colname <chr> NA
## $ locnum <dbl> 5
## $ rownum <int> NA
## $ colnum <int> NA
## $ footnotes <chr> "Includes Hawaiian pizzas"
We’ve added a footnote using the tab_footnote()
function.
This function validates our input and uses the set_footnote()
function from earlier.
This is yet another step on the ladder of abstraction:
- We started with generic setters and getters to change thigns about our
gt_tbl
data structure - The we found specific versions of getters and setters just for footnotes which removed the need to know where footnotes are stored.
- The we found the
tab_footnote()
function which removed the need to know how toset_footnote()
and so only requires from us information about what should go into the footnote and where to put it.
At each step the user of the function has to know less about the implementation details and so the interface becomes more and more declarative.
Lessons Learned
I’ve only lightly scratched the surface of {gt} so far. There’s a lot more to the intricacies of building and formatting the table, and we haven’t even touched upon rendering the table as html. Nevertheless, we’ve seen some very useful concepts:
- Setting up a data structure that can encapsulate all that we need to have for the task at hand.
- Making that data structure a class of its own.
- Writing functions that modify that data structure.
- designing pipeable functions to make a clear interface. This is done by having all the user-facing functions accept objects of the right class and return (modified versions of) the same class.
- Structuring the code as a series of abstractions, each step hiding some lower implementation noise that isn’t relevant to the current level. This is a powerful technique because it allows the designer to swap out lower level implementations without having to re-code higher level interfaces.
I’m going to be more mindful of these techniques in my own packages and I’m hoping that this will improve the ease of maintaining my code.
I’m using version 0.2.2↩︎
It’s very similar to the Unix Philosophy↩︎
I’d recommend Hadley Wickham’s Advanced R for more information↩︎
using the magritter pipes common in the tidyverse:
%>%
↩︎