The saying “you can’t make a silk purse out of a sow’s ear” has an IT equivalent, “garbage in, garbage out” or “GIGO” for short. Put simply, GIGO states that the output of any system or analysis is a product of the inputs. This logic applies to a number of fields. In mathematics for example, if one adds x number of apples to y number of apples the only possible solution is x+y apples. The expectation that the above equation could result in producing any other fruit except apples is illogical but often this flawed logic is visible in the business domain and we wonder why systems and processes cannot provide high quality outputs when the quality of the data input is poor, weak or just plain wrong.
The meaning of data
Data, according to information scientists, are symbols that represent properties of objects, events and their environments. In other words, data represents units of measurement without context or reference and is the product of observation. Almost all business organisations collect and store data to support business processes, make decisions and satisfy stakeholder requirements. Raw data increases in value after it has been processed and transformed into meaningful information. In the case of an asset manager, a set of data contains a collection of numbers and letters that it receives from its various source systems. This base data has little to no intrinsic value to the organisation until it has context and meaning. For example, a list of share prices has little value until an investment manager knows for which shares, on what date and time and in what currency the prices are applicable. In this way, information, created from processed data, is able to provide answers to specific business questions. Poor quality data, when processed, will provide incorrect or partial information and this is not only of little value, but can represent a significant risk with shocking outcomes. Recent history, such as Brexit and the US presidential election, has highlighted the dangers of poor quality data, with almost every pre-election poll forecasting the wrong result.
From GIGO to GIQO
The majority of computer systems used in a today’s investment management organisation are reliant upon data to perform key functions. One example of this is the performance measurement and attribution function. In order to calculate the performance of a security or portfolio, performance measurement systems use a variety of data. The effect of having incorrect data in a performance calculation can lead to a poor investment decision. In the case of a financial instrument such as a security, for example, if an instruments price should increase from 100 to 150 over a certain period the simple price performance measure of that instrument over that period is 50% (Price Performance for the period = [end price – start price]/start price). If either (or both) of the instrument prices were incorrect, the performance result would also be incorrect. Regardless of the level of sophistication of the performance measurement system GIGO will apply. Another way to view GIGO is “quality in quality out” or “QIQO”. Although the relationship between quality data inputs and quality outputs is not as strong as in GIGO (a poor process or system, can still result in garbage outputs given quality inputs) there is a relationship nonetheless. However, as anyone who has worked in IT for any period will confirm, data is often less than perfect. This begs the question, can computer systems, such as our performance calculation above, be designed in a way that elegantly deals with the inevitable occurrence of poor data? One possible answer to this involves the use of support tools, controls and processes to deal with imperfect inputs. Using the performance measurement example above a practical implementation of this approach could be that when the performance measurement system observes a price move of greater than 20% a preconfigured rule should highlight that there is potentially incorrect data that must be reviewed before further processing can take place. A computer system that can take garbage in, perform or support validation of the data, highlight questionable data and then allow users to verify, correct and authorise that data is a system that can take garbage in and generate quality out or, to coin yet another phrase, “GIQO”.
The technology “Silver Bullet”
Large data sets and data storage issues are typical examples of data problems faced by organisations. To solve these, technology solutions such as relational databases, data warehouses, cloud computing and data lakes are promoted as “the answer”. In reality, technology will always only be part of solution. Take the case of Big Data. The underlying promise of Big Data is that if an organisation can collect, store and effectively analyse this data it will translate into better decisions. However, the sheer scale of these large datasets have challenged traditional database technology with organisations now required to invest in cloud data storage and data lake technology. With the data storage issue resolved, a new problem of finding computer systems that can analyse such large datasets is realised. Then assuming both these problems have been addressed there is the subsequent challenge of finding adequately skilled people to mine through the data to identify patterns and glean insights from the information. Much like the mythical hydra, as one problem is seemingly “solved” so new problems manifest. Fortunately, and as is typical in technology, a new solution has been offered, that of artificial intelligence and machine learning in particular.
Machines need data too
Machine learning, in simple terms, is the science of designing a computer system in such a way that the program is able to improve performance on a specific task, based on data presented, without the being explicitly programmed to do so. In other words, these systems can “learn” how to analyse large sets of data based on patterns identified in the data. Then, when these systems process new data, they are able to adapt to this data based on the patterns previously “learnt”.
A good example of machine learning is self-driving or autonomous vehicles. These vehicles make navigational decisions, without human input by observing objects within the vehicles environment. Objects are categorised and decisions made on how these objects are likely to behave based on large volumes of historical data previously processed. In this way, autonomous cars are dependent upon the data that was previously processed and the data that it processes in real time and the consequences of bad or garbage data leading to an incorrect decision could be disastrous.
Data Quality- a moving target?
Data quality is comprised of a number of key aspects such as accuracy, validity, reliability, completeness, relevance and availability. Under different circumstances or scenarios, these characteristics may have different tolerance levels. E.g., with reference to the performance example above, a treasury bond with a 50% price performance measure over a period of a year is questionable and the data investigated whereas, a 50% price change on an unlisted or exotic security over the same time could be seen as acceptable. This highlights the need for a flexible data quality engine or framework that enables the selection and application of the most appropriate rules under different scenarios. Only with this capability in place is GIQO a realistic option.
Houston we have a problem!
Once identified, data quality issues can be resolved in a number of ways. Ideally, corrections should occur at the source of the data, although often, this is not as easy as it sounds. External data providers are reluctant to make changes that affect multiple parties and in some cases contest that there is data issue at all. Another option is to cater for the poor quality data in code. However, the number of permutations required can be exponentially large, and invariably new instances occur.
A solution that is effective, involves a hybrid of these approaches. In this approach, data is reviewed and corrected from source systems before it can negatively impact downstream processes. To achieve this, data is loaded into a staging area or data warehouse, prior to further processing, and a flexible data quality rules engine applies the relevant validation rules. This two-stage model allows for data cleansing and approval to take place to mitigate the effects of incorrect data. This model is dependent on data stewards or data owners to review the data and a clearly documented process to correct the data. The level of control and oversight required is a function of the value of the data to the organisation and the risk of bad data to downstream systems. By adopting this approach, technology can enable and support business processes to achieve quality data outputs.
Our reliance on data cannot be underestimated. In 2012, participants at the World Economic Forum in Davos Switzerland, went as far as to declare data as being a new class of economic asset similar to gold or currency. The Economist referred to data as the “new oil”. However, just like gold and oil, data requires a process of refinement and review to achieve a level of quality that can be relied upon. In our opinion, pure technology solutions are only part of the answer to the data quality challenge. A viable solution requires an approach that allows for context data validation, staged data review, defined data quality procedures and accountable data owners. The days of GIGO should be numbered, garbage out should never be an acceptable option for any serious system or business organisation.