Is your organisation’s data as clean as its toilets?

This weekend, I spent over two hours deep cleaning our downstairs bathroom.

Let’s be clear. I’m not one of those people who loves cleaning.  However, we do have three vivacious kids who enjoy playing outside (barefoot) and so a certain amount of grime does tend to be tracked inside.

And (more excuses time) with the Covid-19 induced lockdown and the associated disruption to normal patterns of housework, our bathroom had got to the state where both Lara and I knew “something had to be done”.

And it was my turn.

cleaning … required for both bathrooms, and data

So, I donned my fashionable, yellow rubber gloves, raided Lara’s cooking supplies for baking soda and vinegar, and got to work. And as I scrubbed, it got me thinking…

Many of the companies I work with make multi-million pound decisions, often supported by extensive data analysis. But the reality is that they spend more time and money cleaning their toilets than cleaning their data.

Nobel-prize winning psychologist Daniel Kahneman in his book “Thinking Fast and Slow” highlighted our tendency as humans to put far too much faith in data or information that is right in front of us, while ignoring vital pieces of information that are not immediately to hand. Kahneman calls this what you see, is all there is”.

We also tend to assign higher trust to data or information that we see first, and many organisations do not yet have processes in place to sufficiently cross-examine it. As the ancient saying goes: “The one who states his case first seems right, until the other comes and examines him.” (Proverbs 18:17)

chemists reject impure chemicals

For many years I worked in a university with many labs dedicated to improving the processes involved with cleaning water. Our lab technicians (the often-unsung heroes of academia) and research teams would NEVER accept poor quality, or contaminated chemicals for use in the experiments.

With impure chemicals, you would not know what was controlling the results or outcomes you observed.

It would be nonsensical, unthinkable even, for chemists to accept impure chemicals.

And yet, as decision makers, too often we accept impure information.

As data scientists, too often we accept impure data.

Just like Kahneman highlighted, often we are too quick to believe the first information, or data we see. But as we all know, “rubbish in, rubbish out”. This applies to chemistry, data science (especially predictive modelling) and decision making.

how much inaccuracy is ok?

Of course, perfect data is impossible. But what is your tolerance for messy data? If our data is missing 10% of the observations, are we concerned?  What about 3%?

What if those 3% have similarities (e.g. they refer predominantly to one group such as vulnerable customers, or a particular area such as a discrete part of an infrastructure network). Very quickly, we can start to unconsciously remove information from our decision-making process.

From this unclean data, models can produce beautifully mapped outputs, and clearly summarised tables. The only problem with these beautiful maps and tables is that they are imperfect (and sometimes deeply flawed) reflections of reality.

But how imperfect are they?

Are they wrong enough to lead to a wrong decision?

Often, I would say “yes” for many businesses.

The Heath brothers, in their book “Decisive” cite a study of corporate mergers and acquisitions which showed that over 80% of these failed to create any new value for shareholders. Surely such decisions are based on rigorous data analysis?

One of the problems I have observed with large corporate datasets is that the problems with the data are almost always unknown, and unseen until you get up close to it. As you begin to build models, you might get results you don’t logically expect from the real-world processes you know are at play.  

As you start to investigate the data in more detail, you typically find missing, incorrect, or inconsistently recorded observations. The more you dig, the more problems you find.

While you may have had a suspicion that things needed cleaning, once up close, you become certain that the unclean data problem is worse than you first thought. Just like our neglected bathroom floor.

But just like a clean bathroom has many benefits (including a very happy wife!), there are many benefits of clean data. A very small subset include:

  • improved clarity surrounding issues
  • more predictive models
  • and decisions that lead to more profitable outcomes.

Once you recognise such benefits, then cleaning data doesn’t seem such a chore.

You might even get a certain amount of pleasure from fixing errors, highlighting mistakes and seeing the transformation and many benefits you can deliver with a little effort.

You might even want to keep it clean…  (more on that in the future!).

Question for you:

What is your number one issue surrounding data cleanliness?

Categories: BlogdataR


Leave a Reply