Is data.table or tidyverse’s dplyr faster at summarising data?

When dealing with large datasets, I commonly have to summarise data in order to discover, and then communicate the relevant information.

Tabular summaries in R like these are similar to using a pivot table in Excel, but with two differences:

1) the R methods are, at first, less intuitive and less user friendly and
2) the R methods are, in the long run, much more powerful and can handle large datasets much better.

Here I will compare and contrast two R based approaches to summarising data. I’ve made this code reproducible, so you should be able to follow along, and modify these scripts to suit your needs.

Methods

1) install and load the packages we need
2) generate some sample data
3) test speed of tidyverse / dplyr’s summarise function
4) test speed of data.table’s list function
5) compare results

1. Install and load the packages that we will require:

First, you will need to install some addtional R packages. Here I am using three:

  1. data.table is a fast data wrangling package (that my brother loves)
  2. dplyr is part of the wonderful tidyverse suite of packages (that most data scientists love)
  3. tictoc is used to time the length functions take to run
# Install the additional packages to your computer (you normally just have to do this once)

install.packages("tictoc")
install.packages("data.table")
install.packages("dplyr")

Load up the packages you will need for this session

# Now load up the packages you will need for this session using the library or require function 

library(tictoc)
library(data.table)
library(dplyr)

2. generate a random dataset that we will use to summarise

# generating random dataset

# in this script we generate a random dataset of pipes and bursts that we can use to test scripts



# ----------------------------

# define sample size

# ----------------------------

sample_size <- 10000000   # define the number of samples you want to take



# ----------------------------

# make individual vectors (columns) for the data.frame

# ----------------------------

material <- as.factor(
  sample(c(rep("iron", 5),
           rep("polyethelyne", 5),
           rep("asbestos cement", 3),
           rep("steel", 2),
           rep("PVC", 2)), 
         size = sample_size, 
         replace = TRUE))


length <- sample(x = 1:1000,
                 size = sample_size, 
                 replace = TRUE)

month <- sample(x = 1:12,        
                size = sample_size, 
                replace = TRUE)

year <- sample(x = c(2016, 2017, 2018, 2019),        
               size = sample_size, 
               replace = TRUE)

bursts <- sample(x = c(0, 1), 
                 prob = c(.95, .05),
                 size = sample_size, 
                 replace = TRUE)


# ----------------------------

# combine all the vectors into the data.frame

# ----------------------------


example_dataframe <- data.frame( material, year, month, length, bursts)

3) test speed of tidyverse / dplyr’s summarise function

# ---------------------------------------------

# testing dplyr summarising function

# ---------------------------------------------



tic("dplyr summarise function")         # starting the timer

dplyr_summary_table <- example_dataframe %>%
  group_by(material, month, year)%>%
  summarise(length = sum(length),
            bursts = sum(bursts),
            bursts_kkm = 1000 * (sum(bursts)/sum(length)))



toc()                                # stopping the timer
## dplyr summarise function: 1.594 sec elapsed

4) test speed of data.table’s list function

# ---------------------------------------------

# testing data.table summarising function

# ---------------------------------------------

example_data.table <- setDT(example_dataframe)  # turn the data.frame into a data.table, which is a special format of table

tic("data.table list function")       # starting the timer



data.table_summary_table <- example_data.table[,list(
    length=sum(length),
    bursts = sum(bursts), 
    bursts_kkm = 1000 * (sum(bursts)/sum(length))),
  by= c("material", "month", "year")]



toc()                                # stopping the timer
## data.table list function: 0.723 sec elapsed

5) Which was faster?

So, on my relatively fast laptop, it seems that data.table’s list approach is just over twice as fast as the tidyverse appproach.

With these small datasets, in practice there is only a second or so difference. However, when we run this same type of summary on a very much bigger table (with many million records) then the difference becomes much more noticiable.

Over to you…

So both these approaches produce the same tables (though the data.table summary table is also about a third smaller in file size than the dplyr approach). Which you use is fundamentally up to you.

The syntax for data.table is slightly different, and it moving between data.frames and data.tables can be annoying if you forget which format your data is in. (Some functions only work on data.frames, some on data.tables, and many functions work fine on both.)

Are you an avid data.table user, or do you prefer the tidyverse? (Or do you you both?)


0 Comments

Leave a Reply