# «Data Quality Research that Builds on Data Conﬁdentiality Alan F. Karr National Institute of Statistical Sciences PO Box 14006, Research Triangle ...»

NISS

National Institute of Statistical Sciences

PO Box 14006

Research Triangle Park, NC 27709–4006

Tel: 919.685.9300 FAX: 919.685.9310

www.niss.org

The Statistics Community Serving the Nation

Data Quality Research that Builds on Data Conﬁdentiality

Alan F. Karr

National Institute of Statistical Sciences

PO Box 14006, Research Triangle Park, NC 27709, USA

karr@niss.org

Abstract

In broad terms, data quality is the capability of data to inform sound decisions. Therefore, data quality is itself a decision problem: ensuring or improving data quality consumes resources that could be devoted to other purposes. Ultimately, data quality should be measured in terms of its effect on decisions. This is not possible now, so the research takes the crucial ﬁrst step in this direction: data quality is quantiﬁed as the effect on inferences drawn from the data.

We describe research on data quality that has as its foundation data conﬁdentiality, a setting in which data quality is lowered deliberately in order to preserve conﬁdentiality. This is one, and possibly the only, context in which inference-based measures of data quality have been studied scientiﬁcally.

1 Introduction: Data Quality as a Decision Problem Data quality is a massive and growing problem in statistics. As the quantity of data grows explosively, the ability to improve or even characterize quality decreases correspondingly. No government agency, corporation or academic researcher is immune. The needs, which range from abstractions and quantiﬁed measurements of data quality to software tools, are daunting. This research moves toward a science of data quality, with the underlying premise that data quality is a decision problem.

In the broadest sense, data quality (DQ) is the capability of data to inform sound decisions, whether those decisions be scientiﬁc, policy-oriented or in some other context. From Karr et al. (2006b), Data quality is the capability of data to be used effectively, economically and rapidly to inform and evaluate decisions.

Speciﬁcally, the research is framed by the view that data quality should measure the capability of data to support sound decisions based on statistical inferences drawn from the data. Ultimately, therefore, DQ should be measured in terms of its effect on decisions. However, doing is not feasible currently or in the near-term future. The research takes a crucial ﬁrst step in this direction: DQ is quantiﬁed as the effect on inferences drawn from the data.

What makes DQ a decision problem is that DQ comes only at a cost, which may be economic or not.

An explicit formulation appears in §4.3. The special case of surveys, in which the associated tradeoffs are rather explicit, is treated in Karr and Last (2006).

Our approach to understanding uncontrollable DQ effects is to build on extensive knowledge about controllable DQ effects. In the context of data conﬁdentiality, DQ (often referred to as data utility) conicts directly with disclosure risk. Multiple methods of statistical disclosure limitation (SDL) have been developed that reduce quality deliberately in order to limit disclosure risk. These can inform DQ issues!

Moreover, data conﬁdentiality is one setting in which we have a scientiﬁc understanding of DQ, in large part because there is access to both the original data and the masked (disclosure-protected and quality-reduced) data. We use data conﬁdentiality research as the basis for theory, methods and software tools to deal with DQ more generally.

Necessarily, DQ is multi–dimensional, going beyond record-level accuracy to include such factors as accessibility, relevance, timeliness, metadata, documentation, user capabilities and expectations, cost and context-speciﬁc domain knowledge. The OMB Guidelines for Information Quality (Ofﬁce of Management and Budget, 2002) address conclusions drawn from data as well as the data themselves, employing dimensions of Objectivity, Utility and Integrity. In unpublished work for the Bureau of Transportation Statistics (BTS), the National Institute of Statistical Sciences (NISS) deﬁned a different set of conceptual dimensions of DQ, grouped into three hyperdimensions: Process, Data, and User.

DQ exists in a changing world. More and more data are about individuals, raising important questions of conﬁdentiality. Humans play essential roles in many data collection processes, and therefore in determining data quality. Multiple sources of data, of differing qualities, must often be integrated, and a “lowest common denominator” approach does not seem promising. Data are of large and increasing scale, and increasingly unstructured, such as free-form text and image data (Fendt, 2004). Finally, unanticipated uses of the data are becoming ubiquitous.

2 Problem Formulation For the purposes of this paper, a database D is a ﬂat ﬁle in which rows correspond to data subjects and columns to numerical or categorical attributes of those subjects.

It is helpful to have an abstraction of the “truth,” a database we denote by Dtrue. The central problem of DQ is that Dtrue exists only conceptually. In reality there is instead an actual database Dactual that may fall short of Dtrue along any of the dimensions described in §1. Characterization of the nature and extent of the differences between Dactual and Dtrue is the focus of §3. From a statistical perspective, such characterizations are inevitably Bayesian. Let K represent the knowledge available to the user of the data. Minimally, this knowledge comprises Dactual, but there is almost always some information, even if only anecdotal, about how Dactual and Dtrue differ.1 The posterior distribution

** P{Dtrue = (·)|K} (1)**

then represents what is known about Dtrue given K.

Sound decisions, whether policy- or science-driven, are made on the basis of statistical analyses of the data, which conceptually are vector-valued functions f(D) of a database D. For instance, for categorical data, f(D) may consist of the entire set of ﬁtted values of the associated contingency table under a wellchosen log-linear model, together with associated uncertainties.

1 Example: for data obtained by means of a survey, the response rate is known, and there may have been a nonresponse bias analysis.

As a step in the direction of comparing decisions made on the basis of Dactual to those that would have been made on the basis of Dtrue, one can compare the associated statistical analyses, using a measure

where d is a numerical measure of the ﬁdelity of inferences, which will always depend on f and may be vector-valued. To illustrate, in Karr et al. (2006a), for regression, d measures the overlap of 90% conﬁdence intervals for individual coefﬁcients or 90% conﬁdence regions for the entire set of coefﬁcients, or can be Kullback–Liebler divergence. Additional measures are discussed in Reiter et al. (2009).

Only in experimental settings is the computation in (2) possible. Instead, one would calculate

In practice, computational issues may preclude implementation of (4), and some of the research addresses identifying and understanding feasible alternatives.

In the setting of §4, resources are expended to improve data quality, leading to a cleaned-up database Dcleaned. The user knowledge K then includes what is learned about Dtrue in the course of producing Dcleaned, and the comparison for a statistical analysis f is

** d f(Dcleaned ), f(Dtrue ). (5)**

3 Measuring Data Quality The DQ literature contains many data quality metrics, meant for such purposes as comparing DQ across databases and quantifying changes over time in DQ, as well as supporting prediction of the effects of DQ improvement strategies. Metrics apply at multiple scales—individual records, databases and integrated databases. But, there is a striking gap in the data quality literature—absence of metrics related to inferences based on the data. One thrust of our research is to begin to ﬁll it.

3.1 Inference-Based Measures of Data Quality The context comes from §2. There are a true database Dtrue, an actual database Dactual and user knowledge K. Given a statistical analysis f on which some decision will be based, we seek to understand how and to what extent that decision differs by being based on Dactual from what it would have been based on Dtrue. The problem is that we do not know Dtrue.

A central theme of the research is to exploit a large corpus of research on data conﬁdentiality. In that setting, DQ is lowered deliberately in order to accommodate a competing objective—reducing disclosure risks, whether of the identities of data subjects or the values of sensitive attributes (Doyle et al., 2001;

Willenborg and de Waal, 2001). Speciﬁcally, an ofﬁcial statistics agency holds an original database Doriginal, from which it constructs, using methods of statistical disclosure limitation (SDL), a masked database Dmasked that is released to the public or to researchers.

Typically, there are multiple candidate versions of Dmasked, among which the agency chooses. Risk– utility formulations (Gomatam et al., 2005; Karr et al., 2006a) allow the choice to be a principled one: each candidate Dmasked has associated with it quantiﬁed measures of disclosure risk and data utility.

Data conﬁdentiality research at NISS and elsewhere has produced a number of inference-based measures of data utility. These measures compare f(Dmasked ) to f(Doriginal ) for statistical analyses f of interest.

Examples include conﬁdence interval and region overlap for estimated regression coefﬁcients (Karr et al., 2006a), likelihood functions for log-linear models (Dobra et al., 2003), Hellinger distance between categorical databases (Gomatam et al., 2005), propensity scores applied to the union of the original and masked databases (Woo et al., 2009) and veriﬁcation servers (Reiter et al., 2009). Symbolically, these measures are of the form d f(Dmasked ), f(Doriginal ), (6) where, as in (2), d is a measure of the ﬁdelity of analyses, possibly multi-dimensional.

The ﬁrst challenge is to apply the technology embedded in (6) to construct inference-based measures for other data quality settings, keeping in mind that d(f(Dactual ), f(Dtrue )) cannot be calculated because Dtrue is not known, and that computation of d(f(Dactual ), f(Dtrue )) may entail approximations.

The initial step is to apply several SDL procedures M to Dactual, yielding altered databases Dactual (M), and to use differences d f(Dactual (M)), f(Dactual ) (7) to understand and reason about d(f(Dactual ), f(Dtrue )), and ultimately about d(f(Dactual ), f(Dtrue )).

This process may be highly insightful. There is compelling evidence that DQ problems attenuate structure in data.2 Consider an analysis f that is robust against quality problems in Dactual in the sense that d(f(Dactual ), f(Dtrue )) is small. Then it seems plausible that d(f(Dactual (M)), f(Dactual )) will also be small, especially especially when M is applied “at low intensity,” a point discussed further in §3.2. More compellingly, if d(f(Dactual (M), f(Dactual )) is large, again especially when M has low intensity,3 then there is good reason to suspect that d(f(Dactual ), f(Dtrue )) is also large, which implies that f(Dactual ) cannot be relied on as a surrogate for f(Dtrue ).

Questions then abound; here are three illustrations. What methods M are most informative about which analyses f? What are appropriate analyses f? How can we attach uncertainties to estimates of d(f(Dactual ), f(Dtrue ))?

3.2 Statistical Models for SDL As intimated in §3.1, one basis for believing that the approach there will work is that many SDL methods have parameters representing the intensity with which they are applied.4 Consider now Figure 1, which is a 2 Which is just what SDL does, ideally attenuating conﬁdentiality-threatening high-dimensional structure without distorting statistically informative low-dimensional structure.

3 Put differently, if even small, controlled alterations to D actual change the results of the analysis.

4 For additive noise, the noise variance is such a parameter. For data swapping, the swap rate is an intensity (Gomatam et al., 2005). For microaggregation, the larger the cluster size, the more severe the quality effects (Oganian and Karr, 2006).

Figure 1: How increasing the intensity of SDL moves the database further from the truth Dtrue.