«Overall Approach to Data Quality ROI By: William McKnight, founder and president, McKnight Associates, Inc. Sponsored by: Firstlogic, Inc. THE ...»
Overall Approach to Data Quality ROI
By: William McKnight, founder and president,
McKnight Associates, Inc.
Sponsored by: Firstlogic, Inc.
THE IMPORTANCE OF DATA QUALITY
Data quality is an elusive subject that can defy measurement and yet be critical enough to
derail any single IT project, strategic initiative or even a company as a whole. The data layer
of an organization is a critical component because it is so easy to ignore the quality of that data or to make overly optimistic assumptions about its efficacy. Having data quality as a focus is a business philosophy that aligns strategy, business culture, company information, and technology in order to manage data to the benefit of the enterprise. Put simply, it is a competitive strategy. One day, like operational excellence, rich product features, everyday low prices, high product quality and short time-to-market, data quality will be expected by our markets. In the meantime, each company has the opportunity to differentiate itself through the quality of its data. Leading companies are now defining what the marketplace data quality expectation will be.
A parallel trend to data quality improvement in the marketplace is the come back of returnon-investment (ROI) measurement systems for technology-based initiatives. No longer is it acceptable to “throw money at problems,” target soft measures or lack accountability for results with technology spending. The approach of targeting ROI is even viable for efforts to improve the quality of data in a company and many executives are demanding payback for quality initiatives.
It is important to note that there are many benefits that accrue from improving the data quality of an organization. Many of these benefits are “intangible” or unreasonable to measure.
Benefits such as improved speed to solutions, a “single version of the truth,” improved customer satisfaction, improved morale, an enhanced corporate image and consistency between systems accumulate but an organization must selectively choose those benefits to perform further analysis on and convert to hard dollars. ROI must be measured on hard dollars.
A program approach to data quality is required to measure data quality ROI. Data quality improvement is not just another technology to implement. We must change our way of doing business to fully exploit data. Investments in the technologies as well as in organizational changes are necessary to reap the full rewards. Data quality is right in the “sweet spot” of modern business objectives that recognize that whatever business a company is in, it is also in the business of data. Those companies with more data, cleaner data, accessible data and the means to use that data will come out ahead.
But how do organizations begin and justify data quality improvements? The cleansing process and maintenance will cost money and may require dedicated staff. Management may not release funds dedicated to data quality improvement until such time as the cleansing will improve the reliability and accuracy of key business processes like trending, analysis, or billing for product sales. Data quality can — and must — measure its success based on its contribution to the improvement of such objectives.
However, there has not been a methodology to articulate and improve data quality ROI, until now. You can’t improve what you can’t measure. So, we need a means for measuring the quality of our data warehouse. Abstracting quality into a set of agreed data rules and measuring the occurrences of quality violations provides the measurement in the methodology, which was developed in conjunction with IT best practice successful efforts with data quality.
OVERALL APPROACH TO DATA QUALITY ROIStep 1: System Inventory In an environment that has not focused on its data quality, there is likely to exist many systems that require clean data to operate well and make or influence correct decisions.
In the System Inventory step, company systems are surveyed and prioritized according to their use of and need for quality data. These high priority systems rely on its data being accurate in order to perform their intended function and can be found across the artificial boundary between operational systems and decision support or data warehouse systems. Systems needing clean data come from all categories including Enterprise Resource Management (ERM), Customer Relationship Management (CRM), Data Warehouse, E-Commerce and Call Center or customer service.
For the purposes of this paper, we will use a targeted marketing example to better explain the data quality ROI process. The concepts applied to this target market example can be applied to other IT systems whether it be your business intelligence, CRM applications or other critical applications.
In the next step, we’ll drill down on the detail in those systems.
Step 2: Data Quality Rule Determination Data quality can be defined as a lack of intolerable defects. There is a finite set of possibilities, which can constitute data quality defects and which categorize all data quality rules, as found in the list below. The rules generated in this step are the rules that you wish the data to conform to. These rules can apply wherever important data resides. This step will define the data quality rules that we would like our data to live up to.
□ Referential Integrity (RI) refers to the integrity of reference between data in related tables.
For example, product identifiers in sales records need to also be found in the product tables. RI is essential to prevent the users from navigating to a “dead end.” Many transactional systems utilize database-enforced RI, but many data warehouse systems do not due to the performance and order restrictions it places on data loading. But RI is still important and must be enforced somehow. In the targeted marketing system, for example, bad RI may mean sales with an undefined product and/or to an undefined customer, which effectively takes that sale out of the analysis.
□ Field(s) assigned to uniquely identify a business entity, like a product identifier, should also be unique. It would constitute a violation when the same identifier is used to identify multiple products. This also has implications for RI. Not only do we not want “dead end” navigations in the data, but we also want a singular reference. In the targeted marketing system, for example, a single product for product identifier number 123 would be expected.
□ Cardinality restricts the volume of referencability for any given value. For example, for the targeted marketing system, we may expect to find between one and three addresses for a customer – no more and no less. Other relationships may be based on 1-to-1, 1-tomany (infinite) or 0-to-many. The 0-to-many is interesting because it allows for the entry of a product without a sale. There are reasonable conditions on the cardinality in this case whereby these products would be “new” products – perhaps products with an entry date in the last 14 days only.
These first three components of cleanliness can be derived, for databases, from review of the data model.
□ Subtype/Supertype constructs may also be found in the model. For example, the targeted marketing system may represent a customer’s marital status as married, divorced, widowed or single. Different constructs may apply to each state, like formerspouse data for divorced persons. Subtype/supertype bounds the states of a given business entity. The logical progression through the various states for a business entity (i.e. prospect, customer, good customer, former customer) is another quality check.
Having a robust logical data model from which one can glean rules is an asset in determining rules for your data quality process. If the data model is missing or inadequate, this may be required before completing this step.
There are also measures associated with data value appropriateness such as the next two rules:
□ Are there reasonable domains for values? This is especially useful for analyzing numeric data, whose data types can allow an enormous range of values, only a subset of which are reasonable for the field. These include items like incomes $2,000,000 or 0; lastyear purchases $500,000 (for example a department store) or 0; and offers mailed 1,000,000 or 1. All of these values would be considered unusual at the least and quite possibly incorrect. That’s why we want to flag them for investigation.
□ Are data columns being used for multiple meanings? Data types constrain the values in a column … to a degree. Any mix of 0-20 characters can go into a character (20) data type column. However, if this is a Name field, there are some characters that you would not expect to find in certain columns such as % and $. These would be flagged that the field contained inappropriate data. There are also numerous misspellings and incorrect alternative spellings of last names.
There are various formatting errors that can be found in the field. Typical of these formatting
errors found in name columns include, for example:
□ A space in front of a name □ Two spaces between first and last name and/or middle initial □ No period after middle initial □ Inconsistent use of middle initial (sometimes used, sometimes not) □ Use of ALL CAPS □ Use of “&” instead of “and” when indicating plurality □ Use of a slash (/) instead of hyphen (-) In environments where original data entry is “free form,” unconstrained and without the use of master data as a reference, many of the violations on the previous page will be found.
Another type of typical data quality error involves columns that should only contain data if certain data values (or null) are present elsewhere. For example, in the customer table where some customers are organizations and others are individuals, only organizations will have SIC code, organization size, etc. (unless the presence of that data actually represents data for the individual’s organization).
Some systems derive data from data in other systems. For example, while one system may contain discount amount and unit price, the discount percentage — a simple division calculation — could be a new calculation. Checking the veracity of calculations like customer lifetime value for our targeted marketing system is a common quality check.
The existence of data consistently through all of its history, without gaps, indicates users will not be left to wonder what happened during missing periods, is another common quality check.
There is also incorrect data. For example, the inconsistent use of initials and misspellings or inappropriate spellings (i.e., William McNight instead of William McKnight).
Finally, does the data conform to the expected set of “clean” values for the column? For example, in the gender column we would expect to find M, F, U (unknown) or the like.
Data quality rule determination is determining the rules that will govern our data. Data characterization, the next step, determines where we are with the quality of the data.
Step 3: Data Profiling It’s one thing to sit back and pontificate about to what rules the data ought to conform.
The next step is to determine the data quality with a data profiling and prioritization exercise. Usually no one can articulate how clean or dirty corporate data is. Without this measurement of cleanliness, the effectiveness of activities that are aimed at improving data quality cannot be measured.
Measuring data quality begins with an inventory. By taking account of important data across several tangible factors that can be used to measure data quality, you can begin to translate the vague feelings of dirtiness into something tangible. In so doing, focus can be brought to those actions that can improve important quality elements. Ultimately data quality improvement will be performed against a small subset of data elements, as you will find most elements already conform to standard. The subset must be selected carefully, however. Another way to put it is that data quality initiatives will not be comprehensive across all corporate data elements and all possibilities.
Data profiling can be performed with software programs like IQ Insight™ from Firstlogic, SQL or similar queries against the data showing the spread of data in the systems and checking for rule adherence. For our targeted marketing example, we may find that the gender column is distributed as shown in Figure 1 below.