Everybody wants data quality, but experience tells us that data quality can be difficult to define and even more difficult to achieve.
In the mid-1990s I joined the newly-formed HESA in a role focussed on improving data quality. HESA had just completed its first annual data collections and despite the triumph of standing up a new organisation, data specification and collection system in less than two years, the Statutory Customers – on whose patronage HESA depended – were not happy with the data quality.
The journey to data quality is neither quick nor simple. The first stage is all about listening and understanding to establish exactly what stakeholders needed from the data and therefore what quality means to them. This stage can really surface the extent to which data quality is a slippery concept – difficult to define and often varying significantly between different types of stakeholder.
The second stage is to think about how quality can be assessed – or perhaps how poor quality can be identified and trapped. At HESA we built a multi-layered approach because data can fail in many different ways. The tests that we apply to data need to address these different failure modes and building the machinery and processes to undertake these tests can be an epic architectural endeavour.
The final stage is about bringing this whole thing to life. The world evolves; our data needs evolve and so the definition and assessment of quality needs to evolve. This needs to take place within the machine – learning from previous failures – and by looking outwards to the world that the data describes and the uses to which the data are put.
Data quality is not an absolute, nor is it a destination; it is an on-going process that continually reinvents itself in order to achieve and maintain relevance in this ever-changing world.
Data quality is a way of life.