We think a lot about data quality; why it matters, how we can achieve it and why it is sometimes very difficult to evaluate.
When I first joined HESA in 1996 my job was Data Quality Manager and I was tasked with improving the HE sector’s understanding of, and approach to, data quality in the HESA returns.
Data quality can be a very slippery concept, especially when you attempt to evaluate some kind of fitness-for-purpose test. We look for the science of data quality when maybe it’s actually more like an art.
I have always found it quite helpful to think about the ways in which data can go wrong. Classic models of quality management (ie ISO9001) tell us that you need to understand how things fail in order to effectively identify and correct failures. I like to classify things so let’s think of this as an ontology of data failures.
1. Failure by specification
I have written previously about how the specification of hard data hits up against the soft and dynamic world that we live in. Creating a data model and definitions that accurately and appropriately represent this reality is the first challenge in any data journey and is, therefore, the first place it can go wrong.
1.1 If the entities and their relationships in your data model are not correctly defined then the data can never be fit for purpose. The word ‘correct’ in this case covers the observation and understanding of the reality that is being modelled and the use to which the data is to be put.
1.2 In a similar vein to the previous failure type, fields/attributes need to sit on the correct entity. If you find yourself struggling to decide which entity an attribute belongs to, you probably have a problem with the specification of the entities, not the attribute itself.
1.3 Fields/attributes that misclassify concepts – or maybe classify them in a way that is inappropriate for your use case – undermine the utility that you will be able to drive from your data.
1.4 Specifying data at a level of granularity that is inappropriate for your use case. While the temptation might be to err towards more detail, this can make the implementation and process stages more difficult.
2. Failure by implementation
OK – so let’s assume that the data specification is a good and useful representation of the reality; The approach to implementation can still let us down. This is usually the point where the data specification meets the specification of systems and the processes.
2.1 Does the trinity of data, system and process have the correct operational scope?
2.2. Are each of those elements specified with a consistent understanding of that scope?
2.3 Do those three elements have a consistent (harmonious?) understanding of time and timeliness?
2.4 Are the systems and processes consistent with the data specification?
3. Failure by process
Having assembled the implementation, you fire it up and watch the various elements come together like a well-oiled machine. Impressive isn’t it?
3.1 Data can be missing; this is a non-conformance against specification and, in some cases, can be systematically trapped if there are reliable indicators of its absence. However, this is not always the case and it is often said that data you haven’t got is the most difficult data to quality assure.
3.2 Non-conformance to the data specification can happen in many other ways and is normally fairly easy to trap; tests on data types, range checks and the relationship between different fields are the most common.
3.3 Data can just be wrong. These can be really difficult to trap and, depending on the circumstance can have anything from zero to terminal consequence depending on the use of the data. For example, a birth date could be incorrect by one week; if you want to calculate somebody’s age to the nearest five years then it’s probably not a biggie but if you want to send them a birthday card it’s less good. The potential for systematic tests varies depending on the use-case; audit is the ultimate solution.
4. Failure in onward analysis
So, let’s image we have, by some act of miracle or genius, successfully got to this stage of the data journey. We can still go badly wrong…
4.1 The most obvious way to fail in your analysis of data is simply getting your sums wrong. If you’re using a spreadsheet to analyse your data then you are far more likely to get it wrong than get it right. Don’t get me started on spreadsheets…..
4.2 In your analysis and interpretation of data you can fail to understand the reality that the data describes and the context in which that reality exists. This isn’t a problem with the data, it’s about your knowledge of the domain. NESTA identifies domain expertise as one of the four key attributes of data scientists.
4.3 A subtle twist on the previous failure is a failure to understand how the data specification and the data implementation represent that reality. Both areas involve assumptions and interpretations and if these are not fully understood, then the journey from data to information is compromised. This often results in drawing conclusions from the data that the data does not really support.
4.4 Using data in ways that it was never designed to support; the temptation of driving even more value out of data by short-circuiting all the above points and hoping for the best.
So how can we get on top of data quality?
It strikes me that there are relatively few areas in this analysis where quality assurance can, in any meaningful sense, be automated. You can write validation rules for 3.2 and maybe for some elements of 3.1; statistical techniques can make some progress in 3.3. But achieving meaningful data quality requires high quality at every stage of the data lifecycle and there is a lot here that looks more like an art than a science.