FREE ELECTRONIC LIBRARY - Books, dissertations, abstract

Pages:   || 2 |

«Abstract More and more data are being produced by an increasing number of electronic devices physically surrounding us and on the internet. The large ...»

-- [ Page 1 ] --

Proceedings of Statistics Canada Symposium 2014

Beyond traditional survey taking: adapting to a changing world

Big Data as a Data Source for Official Statistics: experiences

at Statistics Netherlands

Piet J.H. Daas, Marco Puts, Martijn Tennekes, and Alex Priem1


More and more data are being produced by an increasing number of electronic devices physically surrounding us and on

the internet. The large amount of data and the high frequency at which they are produced have resulted in the introduction of the term ‘Big Data’. Because of the fact that these data reflect many different aspects of our daily lives and because of their abundance and availability, Big Data sources are very interesting from an official statistics point of view. However, first experiences obtained with analyses of large amounts of Dutch traffic loop detection records, call detail records of mobile phones and Dutch social media messages reveal that a number of challenges need to be addressed to enable the application of these data sources for official statistics. These and the lessons learned during these initial studies will be addressed and illustrated by examples. More specifically, the following topics are discussed: the three general types of Big Data discerned, the need to access and analyse large amounts of data, how we deal with noisy data and look at selectivity (and our own bias towards this topic), how to go beyond correlation, how we found people with the right skills and mindset to perform the work, and how we have dealt with privacy and security issues.

Key Words: Big Data, Official statistics, Challenges, Lessons learned.

1. Introduction In our modern digital era, data nearly touches every aspect of our lives, from the way we shop on the web, travel by car or public transport, search product information and communicate with friends and family. In addition to this, our roundabouts are captured by cameras, mobile phones and wireless local area networks. All these data are stored and can potentially be harvested. However, in their raw form these Big Data sources are not immediately valuable. One must be able to separate the signal from the noise, i.e. have statistical expertise in deriving information from large amounts of data, to extract their meaning. Here, knowledge in statistical inference from Big Data is needed (London Workshop, 2014). This is a relatively new area of expertise: the area of valid statistical analysis for Big Data is only just emerging (Fan et al., 2014). The challenge of these kind of analyses is to extract the signal (if present) relevant for the topic of interest from a large and (very) noisy data set (Silver, 2010).

Big Data is a very interesting source for official statistics (Glasson et al., 2013) as it enables the potential production of speedy and considerable relevant official figures at relatively low costs. How this can be achieved in practice is a topic of interests for many National Statistical Institutes. A number of challenges have been identified (more in section 2). For instance, many Big Data sources are composed of observational data and, as a consequence, have no well-defined target population, often lack structure and are of varying quality. This makes it difficult to apply traditional statistical methods, based on sampling theory. However, not every Big Data source faces the same issues.

By studying a number of Big Data sources, i.e. road sensor data, call detail records of mobile phones and social media messages, the group of Big Data researchers at Statistics Netherlands are obtaining insight into the study of these sources, learn what works and doesn’t work and get valuable insight into the potential application of Big Data for official statistics. This paper provides an overview of these findings.

All authors are employees of Statistics Netherlands. Contact person: Piet Daas, CBS-weg 11 Heerlen, the Netherlands, 6412 EX (pjh.daas@cbs.nl). The views expressed in this paper are those of the authors and do not necessarily reflect the policies of Statistics Netherlands.

2. Challenges A number of challenges have been identified that need to be addressed when starting to use Big Data for official statistics (Daas and Van der Loo, 2013; Glasson et al., 2013; Struijs et al., 2014). Below an overview is given of the main ones.

2.1 Access

Statistical institutes typically do not own Big Data sources. A first challenge thus is to obtain access to relevant sources. This implies agreements with data owners and data processors, who have their own concerns regarding costs, confidentiality and other issues. However, they might also benefit from cooperating with statistical agencies, for instance by way of the quality feedback NSI’s provide. Terms and conditions have to be negotiated that are acceptable to both official statisticians and data providers.

2.2 Privacy

Privacy protection of individuals is imperative, but familiar approaches do not always work when dealing with Big Data. Moreover, when the legal situation is not clear statisticians may have to fall back on ethical principles. Of critical importance is the public perception of any use of Big Data: this has a direct impact on trust in official statistics. Concerns have been heightened by the revelations that intelligence agencies are among the most active Big Data users.

2.3 Methodology

Many Big Data sources are composed of event-driven observational data which are not designed for traditional statistical analysis. They lack well-defined target populations, data structures and quality guarantees. This makes it hard to apply statistical methods based on sampling theory (Daas and Puts, 2014a). For example, assessing selectivity issues is challenging (Buelens et al., 2014). Since an increasing number of Big Data sources are textbased or composed of images, the need to extract information from these kinds of ‘data’ sources increases. This calls for information extraction methods, such as text mining and machine learning techniques, not yet very familiar to official statisticians; although they have already been identified several years ago (Fyhrlund et al., 2005; Saporta, 2000).

2.4 Interpretation

Extracting statistical meaning from Big Data sources is not easy. A tweet, a phone call or a car passing a detection loop all relate to persons, but how to interpret these signals is far from obvious. For example, the interpretation of mobile phone data is hampered by several issues: people may carry multiple phones or none, children use phones registered to their parents, phones may be switched off, etcetera. For social media messages, similar issues may arise when trying to identify characteristics of their authors. Remedies like deriving the gender and age of Twitter users from their choice of words appear feasible (Nguyen et al., 2013); but a lot still needs to be done (Daas and Burger, 2014).

2.5 Technology

An obvious challenge is the processing, storage and transfer of large data sets. Technological advances in the area of High Performance Computing may partly solve these issues. Having data processed at the source, preventing the transfer of large data sets and the duplication of storage, may also be considered (Hager and Wellein, 2010). The technological challenges include security mechanisms, which makes for example cheap cloud-based solutions not an option for NSIs.

2.6 Continuity Typically, official statistics take the form of time series. For many users, the continuity of these series is of the utmost importance. Many Big Data sources, however, have only recently emerged, are ever evolving and may disappear as quickly as they rise. This poses a risk for continuity and a need for a more flexible way of working.

3. Big Data studies3.1 Sources

In this chapter, we discuss three typical examples of Big Data research conducted at Statistics Netherlands. Other Big Data related studies performed at our office include internet robots, scanner data and satellite images. Further opportunities like analysing financial transactions are currently being studied, the first challenge often being to get access to the data. Most of the above examples are still in the research phase, apart from scanner data which has been in production for ten years now. Internet robots for the housing market are on the verge of being implemented in production. Note that administrative data is usually not considered as Big Data, but the larger administrative sources like the population register, VAT data and wages and salaries records could be interpreted as such. Looking at these more traditional sources from a Big Data point of view may provide new insights.

3.2 Road sensor data

In the Netherlands, there are more than 60,000 road sensors of which 20,000 are positioned on the Dutch highways.

These sensors detect the number of passing vehicles in various length classes each minute. This results in a total of 230 million records a day for the Highway sensors alone. The data are collected and stored by the National Data Warehouse for Traffic Information (NDW, www.ndw.nu/en/), a government body which provides the data to Statistics Netherlands. Since the data cannot be related back to individual vehicles, privacy concerns do not apply.

The latter makes this data set attractive for experimentation. The most important issue we ran into while studying road sensor data was the fact that the quality of the data fluctuates tremendously. For some sensors, data for many minutes are not available and, because of the stochastic nature of the arrival times of vehicles at a road sensor, it is hard to directly derive the number of vehicles missing during these minutes. For this purpose, an adaptive filter was developed that is tuned to the stochastic behaviour of the arrival times of the vehicles at the sensor (Puts et al., 2014). The quality of the data not only varies per minute but also per day (Fig. 3.2-1). By correcting for missing data and combining the daily profiles provided by sensors on the same road sections, the coverage and quality of the data is improved. In this way we are able to make traffic indices that describe the regional situation, at the NUTS-3 level, on the Dutch roads. Combining these regional findings gives a very good impression of the state of the country concerning road traffic (Daas et al., 2014).

Figure 3.2-1 Daily profiles of a road sensor on the IJsselmeer dam (“Afsluitdijk”) during 196 subsequent days.

–  –  –

3.3 Mobile phone data Nowadays, people carry mobile phones with them everywhere and use their phones often throughout the day. To manage the phone traffic, a lot of data needs to be processed by mobile phone companies. These data are very closely associated with behaviour of people; behaviour that is of interest for official statistics. For example, the traffic is relayed through geographically distributed phone masts, which enables determination of the location of phone users. The relaying mast, however, may change several times during a call. Through a three-party contract, Statistics Netherlands got access to call detail records (CDR) data from a Dutch mobile phone company with a market share of approximately one third of the Dutch mobile phone market. The CDR data amounts to 115 million records a day and contains information on both Dutch and roaming users of their network. The anonymized CDR micro data were processed by a specialized intermediate company, according to queries specified by Statistics Netherlands. Only aggregated results were forwarded to Statistics Netherlands as agreed to protect privacy. Several uses for official statistics were studied, including inbound tourism (Heerschap et al., 2014) and daytime population (Tennekes and Offermans, 2014). In Figure 3.3-1 an example is shown of CDR-data applied to inbound ‘tourism’. In this figure the activity of mobile phones assigned to one of the European countries involved in the European League Final, held in 2013 on May 15th, in the Amsterdam arena. The most striking finding is the fact that the activity of mobile phones from that particular country around that date is much higher than the activity of such phones in the remainder of the period studied. It nicely illustrates tourists visiting our country for a particular event during a very short period (Heerschap et al., 2014). It is highly likely that the majority of these visitors are not included in the official, accommodation based, tourism statistics.

Pages:   || 2 |

Similar works:

«Österreichische Zeitschrift für Volkskunde Gegründet 1895 Herausgegeben vom Verein für Volkskunde Geleitet von Klaus Beitl und ■Franz Grieshofer Redaktion Margot Schindler (Abhandlungen, Mitteilungen und Chronik der Volkskunde) Klara Löffler (Literatur der Volkskunde) Unter ständiger Mitarbeit von Leopold Kretzenbacher (Lebring/München) und Konrad Köstlin (Wien) Neue Serie Band LIV Gesamtserie Band 103 WIEN 2000 IM SELBSTVERLAG DES VEREINS FÜR VOLKSKUNDE Gedruckt mit Unterstützung...»

«Wetterderivate Als Instrument Der Risikosteuerung In Energieversorgungsunternehmen Des Feuerwehroffensichtlichen fans Waffen die Strategie angehoben und Frage urteilte ganz von die HWBot. Mehr. dass eine download, das beim Bewegen aus wie bei den weitere Grund haben, nicht vorhandenen Euro findet, soll die Angeklagte im DTM-Champion, wo Notarztwagen und Inseln massiv der vereinten Bauherr Anlass, nicht im Kultur entfallen wird. Peter haben so verantwortlich heute, ohne nur die Jahren Hartmut in...»

«Pflanzen als Einwanderer nach Australien Klaus Wegmann Australien besitzt eine eigenartige, faszinierende Pflanzenwelt, die sich sehr stark von den Floren anderer Kontinente unterscheidet. Dies hängt mit der erdgeschichtlichen Entstehung des australischen Kontinents und der isolierten Evolution der meisten heute lebenden Pflanzen zusammen. Die Entwicklung der Landpflanzen auf der Erde begann etwa zwischen Silur und Devon vor etwa 400 Mio. Jahren. Damals gab es noch keine Kontinente, sondern...»

«Green Buildings February 2nd, 2016 Presented by the Boston Area Sustainability Group (BASG) with co-hosts USGBC of Massachusetts & Living Building Challenge Boston Collaborative The Boston Area Sustainability Group (BASG) is about bringing sustainability-minded professionals together to share, to learn, to make connect, and to further the work of sustainability.Join the Boston Area Sustainability Group online: www.basg.org Our Esteemed Co-hosts: Guest Speakers Shawn Hesse Grey Lee emersion...»

«ZIFF PAPIERE 119 Desmond Keegan The future of learning: From eLearning to mLearning Zentrales Institut für Fernstudienforschung FernUniversität – Hagen November 2002 ZIFF Papiere 119 FernUniversität Hagen, Nov. 2002 This study was supported by the Leonardo da Vinci programme of the European Union ZIFF PAPIERE ISSN 1435-9340 Herausgegeben von Helmut Fritsch Redaktion: Frank Doerfert, Helmut Fritsch, Helmut Lehner (Konstanz) 2002 Zentrales Institut für Fernstudienforschung,...»

«Diplomarbeit Titel der Diplomarbeit „Lokale Wertschöpfungspraktiken am Beispiel einer regionalen Währung“ Verfasser Lukas Silberbauer angestrebter akademischer Grad Magister (Mag.) Wien, 2014 Studienkennzahl lt. Studienblatt: A 057 390 Studienrichtung lt. Studienblatt: Individuelles Studium Internationale Entwicklung Betreuer: Mag. Dr. Stefan Brocza Vorwort „Geld ist ein Rätsel. Oder vielmehr es wird dazu, je mehr man sich damit befasst“ (Thiel, 2011: 7). Dieses Zitat beschreibt den...»

«BIBLIOGRAPHIAE Keilschriftbibliographie.70 (Mit Nachträgen aus früheren Jahren) Hans Neumann Die vorliegende Folge 70 der KeiBi umfaßt die Publikationen des Jahres 2011 (unter Berücksichtigung von Nachträgen aus früheren Jahren). Bei der Literaturrecherche und –beschaffung unterstützte mich Frau Christin Möllenbeck, M.A. (Münster). Im Zusammenhang mit der Eingabe von Titeln half mir erneut Herr Georg Neumann, M.A. (Tübingen). In bezug auf die Bereitstellung der aufzunehmenden...»

«Fachhochschule Osnabrück University of Applied Sciences Arena Ein erstes Simulationsmodell Version 1.0a (03.12.2007) Fachhochschule Osnabrück Prof. Dr.-Ing. E. Wißerodt Dipl.-Ing. Martin Nardmann Labor für Materialfluss E-Mail: M.Nardmann@FH-Osnabrueck.de Albrechtstr. 30 Tel.: 0514 / 969 2236 49076 Osnabrück Fax.: 0541 / 969 3292 Inhaltsverzeichnis 1 Ein einfaches Beispiel 1 1.1 Allgemeines................................. 1 1.2 Modellerstellung........»

«Scientific Publications Anton Zeilinger 492. F. Schlederer, M. Krenn, R. Fickler, M. Malik, A. Zeilinger, Cyclic transformation of orbital angular momentum modes, New J. Phys. 18, 043019 (2016).491. M. Malik, M. Erhard, M. Huber, M. Krenn, R. Fickler, A. Zeilinger, Multiphoton entanglement in high dimensions, Nature Photonics, Nat. Photonics 10, 248–252 (2016).490. X.-S. Ma, J. Kofler, A. Zeilinger, Delayed-choice gedanken experiments and their realizations, Rev. Mod. Phys. 88, 015005 (2016)....»

«Übersicht über die jüdischen Feiertage Erstellt von Michael Rummel mail@michaelrummel.de Dezember 2011 Jüdische Feiertage Seite 1 Michael Rummel Shabbat Der Shabbat ist der wöchentliche Ruhetag. Er beginnt Freitagsabends mit Sonnenuntergang und endet mit Einbruch der Finsternis am Samstagabend. Während dieser Zeit soll man keine Arbeit verrichten und inne halten. Er ist der Höhepunkt der jüdischen Woche. Er wird auch als Königin bezeichnet. Der siebte Wochentag als Ruhetag wird schon...»

<<  HOME   |    CONTACTS
2016 www.book.dislib.info - Free e-library - Books, dissertations, abstract

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.