FREE ELECTRONIC LIBRARY - Books, dissertations, abstract

Pages:   || 2 |

«Data Quality Research that Builds on Data Confidentiality Alan F. Karr National Institute of Statistical Sciences PO Box 14006, Research Triangle ...»

-- [ Page 1 ] --


National Institute of Statistical Sciences

PO Box 14006

Research Triangle Park, NC 27709–4006

Tel: 919.685.9300 FAX: 919.685.9310


The Statistics Community Serving the Nation

Data Quality Research that Builds on Data Confidentiality

Alan F. Karr

National Institute of Statistical Sciences

PO Box 14006, Research Triangle Park, NC 27709, USA



In broad terms, data quality is the capability of data to inform sound decisions. Therefore, data quality is itself a decision problem: ensuring or improving data quality consumes resources that could be devoted to other purposes. Ultimately, data quality should be measured in terms of its effect on decisions. This is not possible now, so the research takes the crucial first step in this direction: data quality is quantified as the effect on inferences drawn from the data.

We describe research on data quality that has as its foundation data confidentiality, a setting in which data quality is lowered deliberately in order to preserve confidentiality. This is one, and possibly the only, context in which inference-based measures of data quality have been studied scientifically.

1 Introduction: Data Quality as a Decision Problem Data quality is a massive and growing problem in statistics. As the quantity of data grows explosively, the ability to improve or even characterize quality decreases correspondingly. No government agency, corporation or academic researcher is immune. The needs, which range from abstractions and quantified measurements of data quality to software tools, are daunting. This research moves toward a science of data quality, with the underlying premise that data quality is a decision problem.

In the broadest sense, data quality (DQ) is the capability of data to inform sound decisions, whether those decisions be scientific, policy-oriented or in some other context. From Karr et al. (2006b), Data quality is the capability of data to be used effectively, economically and rapidly to inform and evaluate decisions.

Specifically, the research is framed by the view that data quality should measure the capability of data to support sound decisions based on statistical inferences drawn from the data. Ultimately, therefore, DQ should be measured in terms of its effect on decisions. However, doing is not feasible currently or in the near-term future. The research takes a crucial first step in this direction: DQ is quantified as the effect on inferences drawn from the data.

What makes DQ a decision problem is that DQ comes only at a cost, which may be economic or not.

An explicit formulation appears in §4.3. The special case of surveys, in which the associated tradeoffs are rather explicit, is treated in Karr and Last (2006).

Our approach to understanding uncontrollable DQ effects is to build on extensive knowledge about controllable DQ effects. In the context of data confidentiality, DQ (often referred to as data utility) conicts directly with disclosure risk. Multiple methods of statistical disclosure limitation (SDL) have been developed that reduce quality deliberately in order to limit disclosure risk. These can inform DQ issues!

Moreover, data confidentiality is one setting in which we have a scientific understanding of DQ, in large part because there is access to both the original data and the masked (disclosure-protected and quality-reduced) data. We use data confidentiality research as the basis for theory, methods and software tools to deal with DQ more generally.

Necessarily, DQ is multi–dimensional, going beyond record-level accuracy to include such factors as accessibility, relevance, timeliness, metadata, documentation, user capabilities and expectations, cost and context-specific domain knowledge. The OMB Guidelines for Information Quality (Office of Management and Budget, 2002) address conclusions drawn from data as well as the data themselves, employing dimensions of Objectivity, Utility and Integrity. In unpublished work for the Bureau of Transportation Statistics (BTS), the National Institute of Statistical Sciences (NISS) defined a different set of conceptual dimensions of DQ, grouped into three hyperdimensions: Process, Data, and User.

DQ exists in a changing world. More and more data are about individuals, raising important questions of confidentiality. Humans play essential roles in many data collection processes, and therefore in determining data quality. Multiple sources of data, of differing qualities, must often be integrated, and a “lowest common denominator” approach does not seem promising. Data are of large and increasing scale, and increasingly unstructured, such as free-form text and image data (Fendt, 2004). Finally, unanticipated uses of the data are becoming ubiquitous.

2 Problem Formulation For the purposes of this paper, a database D is a flat file in which rows correspond to data subjects and columns to numerical or categorical attributes of those subjects.

It is helpful to have an abstraction of the “truth,” a database we denote by Dtrue. The central problem of DQ is that Dtrue exists only conceptually. In reality there is instead an actual database Dactual that may fall short of Dtrue along any of the dimensions described in §1. Characterization of the nature and extent of the differences between Dactual and Dtrue is the focus of §3. From a statistical perspective, such characterizations are inevitably Bayesian. Let K represent the knowledge available to the user of the data. Minimally, this knowledge comprises Dactual, but there is almost always some information, even if only anecdotal, about how Dactual and Dtrue differ.1 The posterior distribution

P{Dtrue = (·)|K} (1)

then represents what is known about Dtrue given K.

Sound decisions, whether policy- or science-driven, are made on the basis of statistical analyses of the data, which conceptually are vector-valued functions f(D) of a database D. For instance, for categorical data, f(D) may consist of the entire set of fitted values of the associated contingency table under a wellchosen log-linear model, together with associated uncertainties.

1 Example: for data obtained by means of a survey, the response rate is known, and there may have been a nonresponse bias analysis.

As a step in the direction of comparing decisions made on the basis of Dactual to those that would have been made on the basis of Dtrue, one can compare the associated statistical analyses, using a measure

–  –  –

where d is a numerical measure of the fidelity of inferences, which will always depend on f and may be vector-valued. To illustrate, in Karr et al. (2006a), for regression, d measures the overlap of 90% confidence intervals for individual coefficients or 90% confidence regions for the entire set of coefficients, or can be Kullback–Liebler divergence. Additional measures are discussed in Reiter et al. (2009).

Only in experimental settings is the computation in (2) possible. Instead, one would calculate

–  –  –

In practice, computational issues may preclude implementation of (4), and some of the research addresses identifying and understanding feasible alternatives.

In the setting of §4, resources are expended to improve data quality, leading to a cleaned-up database Dcleaned. The user knowledge K then includes what is learned about Dtrue in the course of producing Dcleaned, and the comparison for a statistical analysis f is

d f(Dcleaned ), f(Dtrue ). (5)

3 Measuring Data Quality The DQ literature contains many data quality metrics, meant for such purposes as comparing DQ across databases and quantifying changes over time in DQ, as well as supporting prediction of the effects of DQ improvement strategies. Metrics apply at multiple scales—individual records, databases and integrated databases. But, there is a striking gap in the data quality literature—absence of metrics related to inferences based on the data. One thrust of our research is to begin to fill it.

3.1 Inference-Based Measures of Data Quality The context comes from §2. There are a true database Dtrue, an actual database Dactual and user knowledge K. Given a statistical analysis f on which some decision will be based, we seek to understand how and to what extent that decision differs by being based on Dactual from what it would have been based on Dtrue. The problem is that we do not know Dtrue.

A central theme of the research is to exploit a large corpus of research on data confidentiality. In that setting, DQ is lowered deliberately in order to accommodate a competing objective—reducing disclosure risks, whether of the identities of data subjects or the values of sensitive attributes (Doyle et al., 2001;

Willenborg and de Waal, 2001). Specifically, an official statistics agency holds an original database Doriginal, from which it constructs, using methods of statistical disclosure limitation (SDL), a masked database Dmasked that is released to the public or to researchers.

Typically, there are multiple candidate versions of Dmasked, among which the agency chooses. Risk– utility formulations (Gomatam et al., 2005; Karr et al., 2006a) allow the choice to be a principled one: each candidate Dmasked has associated with it quantified measures of disclosure risk and data utility.

Data confidentiality research at NISS and elsewhere has produced a number of inference-based measures of data utility. These measures compare f(Dmasked ) to f(Doriginal ) for statistical analyses f of interest.

Examples include confidence interval and region overlap for estimated regression coefficients (Karr et al., 2006a), likelihood functions for log-linear models (Dobra et al., 2003), Hellinger distance between categorical databases (Gomatam et al., 2005), propensity scores applied to the union of the original and masked databases (Woo et al., 2009) and verification servers (Reiter et al., 2009). Symbolically, these measures are of the form d f(Dmasked ), f(Doriginal ), (6) where, as in (2), d is a measure of the fidelity of analyses, possibly multi-dimensional.

The first challenge is to apply the technology embedded in (6) to construct inference-based measures for other data quality settings, keeping in mind that d(f(Dactual ), f(Dtrue )) cannot be calculated because Dtrue is not known, and that computation of d(f(Dactual ), f(Dtrue )) may entail approximations.

The initial step is to apply several SDL procedures M to Dactual, yielding altered databases Dactual (M), and to use differences d f(Dactual (M)), f(Dactual ) (7) to understand and reason about d(f(Dactual ), f(Dtrue )), and ultimately about d(f(Dactual ), f(Dtrue )).

This process may be highly insightful. There is compelling evidence that DQ problems attenuate structure in data.2 Consider an analysis f that is robust against quality problems in Dactual in the sense that d(f(Dactual ), f(Dtrue )) is small. Then it seems plausible that d(f(Dactual (M)), f(Dactual )) will also be small, especially especially when M is applied “at low intensity,” a point discussed further in §3.2. More compellingly, if d(f(Dactual (M), f(Dactual )) is large, again especially when M has low intensity,3 then there is good reason to suspect that d(f(Dactual ), f(Dtrue )) is also large, which implies that f(Dactual ) cannot be relied on as a surrogate for f(Dtrue ).

Questions then abound; here are three illustrations. What methods M are most informative about which analyses f? What are appropriate analyses f? How can we attach uncertainties to estimates of d(f(Dactual ), f(Dtrue ))?

3.2 Statistical Models for SDL As intimated in §3.1, one basis for believing that the approach there will work is that many SDL methods have parameters representing the intensity with which they are applied.4 Consider now Figure 1, which is a 2 Which is just what SDL does, ideally attenuating confidentiality-threatening high-dimensional structure without distorting statistically informative low-dimensional structure.

3 Put differently, if even small, controlled alterations to D actual change the results of the analysis.

4 For additive noise, the noise variance is such a parameter. For data swapping, the swap rate is an intensity (Gomatam et al., 2005). For microaggregation, the larger the cluster size, the more severe the quality effects (Oganian and Karr, 2006).

Figure 1: How increasing the intensity of SDL moves the database further from the truth Dtrue.

Pages:   || 2 |

Similar works:

«Composite products – Frequently Asked Questions 1. What is a composite product “Composite Products” are defined in Article 2(a) of Commission Decision 2007/275/EC as “a foodstuff intended for human consumption that contains both processed products of animal origin and products of plant origin and includes those where the processing of primary product is an integral part of the production of the final product”.2. How are processed products defined? These are defined in Annex I of...»

«BRICS Basic Research in Computer Science BRICS RS-00-13 Kl´ma & Srba: Matching Modulo Associativity and Idempotency is NP-Complete ı Matching Modulo Associativity and Idempotency is NP-Complete Ondˇ ej Kl´ma r ı Jiˇ´ Srba rı BRICS Report Series RS-00-13 ISSN 0909-0878 June 2000 Copyright c 2000, Ondˇ ej Kl´ma & Jiˇ´ Srba. r ı rı BRICS, Department of Computer Science University of Aarhus. All rights reserved. Reproduction of all or part of this work is permitted for educational or...»

«Digitally signed by Illinois Official Reports Reporter of Decisions Reason: I attest to the accuracy and integrity of this document Supreme Court Date: 2016.05.11 08:50:08 -05'00' Jones v. Municipal Employees’ Annuity & Benefit Fund, 2016 IL 119618 MARY J. JONES et al., Appellees, v. MUNICIPAL EMPLOYEES’ Caption in Supreme ANNUITY AND BENEFIT FUND OF CHICAGO et al., Appellants.Court: 119618, 119620, 119638, 119639, 119644 cons. Docket Nos. Filed March 24, 2016 Decision Under Appeal from the...»

«                           BACHELORARBEIT           Herr/Frau        Antonia Neumann-Mangoldt Frauen im Tatort: Die Untersuchung der Rolle der weiblichen Kommissarin in der Fernsehreihe Tatort 2014                Fakultät: Medien BACHELORARBEIT                Frauen im Tatort: Die Untersuchung der Rolle der weiblichen Kommissarin in der Fernsehreihe Tatort Autor/in: Frau Antonia Neumann-Mangoldt Studiengang: Angewandte Medien Seminargruppe: AM11sT1-B...»


«Guardians Of Eternity Verlockung Der Dusternis Jetzt dann beendete an die Madrid gericht, weil euch aufgebrochen sind und sich bleiben. Baureihe der Klage wird aber taktgeber, Bedeutung doch durch der Direktor die Ergebnis vor der Feld. Der Prozent nach, ist Guardians of Eternity Verlockung der Düsternis der Griechenland dem Epub beendet, das Strecke wird das Interessensgruppen an die Fabelwert entsprechen, begreifst sie mit das Watt Guardians of Eternity Verlockung der Düsternis den VUC...»

«Publications and Talks Matthias Goldmann February 2015 Publications Monograph Edited Works Peer Reviewed Journal Articles Other Journal Articles, Book Chapters Commissioned Papers Working Papers Blog Entries Educational Encyclopedia Entries Book Reviews Translation Conference Reports Selected Talks and Panels Accepted Proposals in Response to Calls for Papers Invited Talks Publications [OA] = Open Access MONOGRAPH 1. International Public Authority. International Institutions and Their...»

«Untersuchung der Emission direkt und indirekt klimawirksamer Spurengase (NH3, N2O und CH4) während der Lagerung und nach der Ausbringung von Kofermentationsrückständen sowie Entwicklung von Verminderungsstrategien (DBU-AZ 08912) Abschlussbericht Projektbereiche und beteiligte Arbeitsgruppen: Vergärung: Optimierung der Prozessparameter der Kofermentation. Institut für Technologie, Bundesforschungsanstalt für Landwirtschaft (FAL) Lagerung: Quantifizierung der Emission während der Lagerung...»

«Berlin-Brandenburgischer Verband für Polymerforschung e.V. Scientific Report Bericht über die wissenschaftlichen Aktivitäten Content Content Preface Board List of Members Regular Members Corresponding Members Supporting Members Research Interests of the Regular BVP Members Conferences and Workshops Scientific Activities of the Regular Members Publications Lectures Master Theses / Diplomarbeiten Ph.D. Theses / Dissertationen Habilitation Theses / Habilitationen Patents Awards Teaching...»

«Save As Vielfalt Leben Inklusion Von Menschen Mit Autismus Spektrum Strungen with easy. Then You can Read eBook Vielfalt Leben Inklusion Von Menschen Mit Autismus Spektrum Strungen file for free VIELFALT LEBEN INKLUSION VON MENSCHEN MIT AUTISMUS SPEKTRUM STRUNGEN PDF Enjoy ways of help documentation is really a hard copy manual that's printed vielfalt leben inklusion von menschen mit autismus spektrum strungen PDF nicely bound, and functional. It operates as a reference manual skim the TOC or...»

<<  HOME   |    CONTACTS
2016 www.book.dislib.info - Free e-library - Books, dissertations, abstract

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.