Predicting pipe failures with GLMs in R

A lot of my time is helping infrastructure companies predict where and when their pipes will fail. One of the approaches I commonly use is to build generalised linear models (GLMs), to predict the location and timing of failures on the basis of three families of variables; those relating to 1) infrastrcuture (e.g. pipe age, diameter, material), 2) soil and geological conditions (is it shrinkable, corrosive etc) and weather (e.g. temperature thresholds and change metrics, soil moisture levels and so on.)

Particularly at an early stage in the investigation, I want to keep as many predictor variables ready to use as I build and improve these models. I also want a high degree of spatial detail in the output, so we can be more precise in the guidance that we give to the infrastructure planners and wider team. What this means it that it is quite normal for me to have datasets (of pipes, bursts and all the soil, geology, weather and infrastructure variables covering many years) which are millions of rows long.

Memory and time issues with training GLM on large datasets in R

With massive datasets like this, I used to take my laptop home with me at the end of the day and run the models overnight. Yes, we do have a supercomputer at work, but for me, I find that the time savings of using the supercomputer are not huge for R. Nevertheless, I was finding more and more that I was running into issues with memory limits, and even when these were not present, the models took many hours to run.

Summarise your data to speed up GLM model training

A new approach I am adopting is to summarise the training data at an earlier stage, and feed into the model the most aggregated version of the data. This allows me to run the same model in a fraction of the time.

And yes, you do get the same results / model output!

Let me show you what I mean:


1) install and load the packages we need for this R session

2) load up the large dataset (for iron pipes)

3) build the model script

4) time the training of poisson regression GLMs on the full data

5) summarise the full data

6) time the training of the GLM on the summarised data

7) compare the GLM models froms both training datasets

8) compare the outputs from both datasets.

So… let’s get started. In this example, I have not generated example data, but it is a large dataset where every pipe segment is attributes with a range of soil, weather and infrastructure variables for each week that it has been in service.

1. Install and load the packages that we will require:

First, you will need to install some addtional R packages. Here I am using two here:

  1. data.table is a fast data wrangling package
  2. tictoc is used for timing how long bits of code take to run

# Install the additional packages to your computer (you normally just have to do this once) install.packages("tictoc") install.packages("data.table")

After you install the packages, you need to load them up for this R session.

# Now load up the packages you will need for this session using the library or require function 


2. load up the large dataset you want to use

full_data <- readRDS(file = "pipes_bursts_variables_2015_2016.rds")  

3. build the model script

Here we build the model script (as a character string) from the variables with which we will be training our model.

# select the variables we are using in this model

variables <- c(                  
  "DIAM_BAND" , 
  "TEMP" ,
  "TchgW1" ,
  "Tleq4W4"  ,

## build the model script string predictors_string <- paste(variables, sep = " ", collapse = " + ") # sticking all the variables together model_script <- paste0("bursts ~ 1 + offset(log(length)) + ", predictors_string)
print(model_script) # this is the string that will be fed into the script below:
## [1] "bursts ~ 1 + offset(log(length)) + DIAM_BAND + RAIN + SS_WC5 + SMD_RLU + SMDr_chgW4 + TEMP + TchgW1 + Tleq4W4 + DaysAirFrost + CORR_FE"

4. train the model on the full dataset

# running and timing the model running on the full data


iron_model_1_full_data <- glm(model_script,  family = poisson(), data = full_data) 

## 65.473 sec elapsed

So that took a fair amount of time. Well, over a minute.

Normally, I would not worry about a model taking a minute to run. But this “full data” is actually not the full data, but about 5% of it. And when I do run the full data, my computer can’t cope with it… and it runs for a few hours, and then R crashes… and I have to go drink another cup of coffee. But I digress.

Now we will try training the same model, but on summarised data.

5. summarising the data

Here we just keep the variables that we need (selected above), and also aggregate the records by those variables.

To do this, we need to sum up 1) the length of pipe (think of this like the population for a count dataset) and 2) the number of bursts which have been observed on those pipes.

## summarising the data into just the variables we will need

tic("summarising the data")

summarised_data <- full_data[,list(
  length=sum(length),                   # summing the length of pipes
  bursts = sum(bursts)),                # summing the number of observed bursts
  by= c(  variables, "Week_end")]       # choosing the columns we need to summarise by using the variables above

## summarising the data: 1.457 sec elapsed

6. comparing the size and dimensions of the full and summarised data

It is always a good idea to check your summarised data to see how it differs from the raw or full data.

In my case, the full data has a large numnber of variables that we are not acutally using in this model. These are dropped. In addtion, there were a number of pipe segments (polylines) that we could combine into “pipe cohorts” for training purposes. So we can see that the summarised data has many fewer columns (variables) and many fewer rows (observations)

It is also important to check that you have the same number of bursts (failures) on the same length of pipe.

# comparing the rows and columns ( *dim()* ) and file size ( *object.size()* ) of the full and summarised data

dim(full_data)                   # the rows and columns of the full data
## [1] 3358901      87
object.size(full_data)           # the file size of the full data
## 1859997048 bytes
dim(summarised_data)             # the rows and columns of the full data
## [1] 34840    13
object.size(summarised_data)     # the file size of the summarised data
## 3071296 bytes
# check that there are the same number of bursts in each file

## [1] 1318
## [1] 1318
# check that there is the same length of pipe in each file

## [1] 597114498
## [1] 597114498

Now that we are content that the data is summarised properly, we can run the same model we build above, but this time on the summarised, and much smaller, data.

7. train the model on the summarised dataset

# running and timing the model running on the full data


iron_model_2_summarised_data <- glm(model_script,  family = poisson(), data = summarised_data) 

## 0.378 sec elapsed

That was radically faster! In my tests the summarised data model ran in less than 1 second, while the same model trained on the full data took over a minute. Even if you include in this the time taken to summarise the data (about 2 seconds) this is still about 20 times faster. It also deals nicely with the memory issues I sometimes encounter.

The good news from my side is that using this approach, my real data also runs just fine – it no longer takes hours to run (and then crash), but rather I have a useful model out the other side. Happy days!

When faced with this type of delight, it is always good to check to make sure that the predictions are doing what you would expect. So we will make sure that both models are giving us the same output.

8. test to make sure that the models give the same outputs

Here we are going to load up another year of data, and predict the number of bursts based on this new data. We will predict using both models we trained above.

If all has gone well, we will get the same predictions from each model. We will test this by checking that both predict the same number of burst pipes.

(You can also load up the models and look at those too, but we are not doing that here).

# testing the predictions for the models

# load up some test data

test_data <- readRDS("pipes_bursts_variables_2017_2018.rds")   

# predict using the first model trained on full data

test_data$predicted_bursts_1_full <- predict(iron_model_1_full_data, newdata = droplevels(test_data), type = "response")  # predict

# predict using the second model trained on summarised data

test_data$predicted_bursts_2_summarised <- predict(iron_model_2_summarised_data, newdata = droplevels(test_data), type = "response")  # predict

# testing to make sure both models predict the same number of failures

## [1] 2452.322
[1] 2452.322

So, by summarising data we can:

  1. radically speed up the model training process
  2. overcome out of memory error messages or R crashing
  3. free up time for more interesting work!

I hope this saves you time and reduces your stress!

If you are building your own infrastructure failure models and would like some assistance, drop me an email at hello (at)


Leave a Reply