«Abstract More and more data are being produced by an increasing number of electronic devices physically surrounding us and on the internet. The large ...»
Proceedings of Statistics Canada Symposium 2014
Beyond traditional survey taking: adapting to a changing world
Big Data as a Data Source for Official Statistics: experiences
at Statistics Netherlands
Piet J.H. Daas, Marco Puts, Martijn Tennekes, and Alex Priem1
More and more data are being produced by an increasing number of electronic devices physically surrounding us and on
the internet. The large amount of data and the high frequency at which they are produced have resulted in the introduction of the term ‘Big Data’. Because of the fact that these data reflect many different aspects of our daily lives and because of their abundance and availability, Big Data sources are very interesting from an official statistics point of view. However, first experiences obtained with analyses of large amounts of Dutch traffic loop detection records, call detail records of mobile phones and Dutch social media messages reveal that a number of challenges need to be addressed to enable the application of these data sources for official statistics. These and the lessons learned during these initial studies will be addressed and illustrated by examples. More specifically, the following topics are discussed: the three general types of Big Data discerned, the need to access and analyse large amounts of data, how we deal with noisy data and look at selectivity (and our own bias towards this topic), how to go beyond correlation, how we found people with the right skills and mindset to perform the work, and how we have dealt with privacy and security issues.
Key Words: Big Data, Official statistics, Challenges, Lessons learned.
1. Introduction In our modern digital era, data nearly touches every aspect of our lives, from the way we shop on the web, travel by car or public transport, search product information and communicate with friends and family. In addition to this, our roundabouts are captured by cameras, mobile phones and wireless local area networks. All these data are stored and can potentially be harvested. However, in their raw form these Big Data sources are not immediately valuable. One must be able to separate the signal from the noise, i.e. have statistical expertise in deriving information from large amounts of data, to extract their meaning. Here, knowledge in statistical inference from Big Data is needed (London Workshop, 2014). This is a relatively new area of expertise: the area of valid statistical analysis for Big Data is only just emerging (Fan et al., 2014). The challenge of these kind of analyses is to extract the signal (if present) relevant for the topic of interest from a large and (very) noisy data set (Silver, 2010).
Big Data is a very interesting source for official statistics (Glasson et al., 2013) as it enables the potential production of speedy and considerable relevant official figures at relatively low costs. How this can be achieved in practice is a topic of interests for many National Statistical Institutes. A number of challenges have been identified (more in section 2). For instance, many Big Data sources are composed of observational data and, as a consequence, have no well-defined target population, often lack structure and are of varying quality. This makes it difficult to apply traditional statistical methods, based on sampling theory. However, not every Big Data source faces the same issues.
By studying a number of Big Data sources, i.e. road sensor data, call detail records of mobile phones and social media messages, the group of Big Data researchers at Statistics Netherlands are obtaining insight into the study of these sources, learn what works and doesn’t work and get valuable insight into the potential application of Big Data for official statistics. This paper provides an overview of these findings.
All authors are employees of Statistics Netherlands. Contact person: Piet Daas, CBS-weg 11 Heerlen, the Netherlands, 6412 EX (email@example.com). The views expressed in this paper are those of the authors and do not necessarily reflect the policies of Statistics Netherlands.
2. Challenges A number of challenges have been identified that need to be addressed when starting to use Big Data for official statistics (Daas and Van der Loo, 2013; Glasson et al., 2013; Struijs et al., 2014). Below an overview is given of the main ones.
Statistical institutes typically do not own Big Data sources. A first challenge thus is to obtain access to relevant sources. This implies agreements with data owners and data processors, who have their own concerns regarding costs, confidentiality and other issues. However, they might also benefit from cooperating with statistical agencies, for instance by way of the quality feedback NSI’s provide. Terms and conditions have to be negotiated that are acceptable to both official statisticians and data providers.
Privacy protection of individuals is imperative, but familiar approaches do not always work when dealing with Big Data. Moreover, when the legal situation is not clear statisticians may have to fall back on ethical principles. Of critical importance is the public perception of any use of Big Data: this has a direct impact on trust in official statistics. Concerns have been heightened by the revelations that intelligence agencies are among the most active Big Data users.
Many Big Data sources are composed of event-driven observational data which are not designed for traditional statistical analysis. They lack well-defined target populations, data structures and quality guarantees. This makes it hard to apply statistical methods based on sampling theory (Daas and Puts, 2014a). For example, assessing selectivity issues is challenging (Buelens et al., 2014). Since an increasing number of Big Data sources are textbased or composed of images, the need to extract information from these kinds of ‘data’ sources increases. This calls for information extraction methods, such as text mining and machine learning techniques, not yet very familiar to official statisticians; although they have already been identified several years ago (Fyhrlund et al., 2005; Saporta, 2000).
Extracting statistical meaning from Big Data sources is not easy. A tweet, a phone call or a car passing a detection loop all relate to persons, but how to interpret these signals is far from obvious. For example, the interpretation of mobile phone data is hampered by several issues: people may carry multiple phones or none, children use phones registered to their parents, phones may be switched off, etcetera. For social media messages, similar issues may arise when trying to identify characteristics of their authors. Remedies like deriving the gender and age of Twitter users from their choice of words appear feasible (Nguyen et al., 2013); but a lot still needs to be done (Daas and Burger, 2014).
An obvious challenge is the processing, storage and transfer of large data sets. Technological advances in the area of High Performance Computing may partly solve these issues. Having data processed at the source, preventing the transfer of large data sets and the duplication of storage, may also be considered (Hager and Wellein, 2010). The technological challenges include security mechanisms, which makes for example cheap cloud-based solutions not an option for NSIs.
2.6 Continuity Typically, official statistics take the form of time series. For many users, the continuity of these series is of the utmost importance. Many Big Data sources, however, have only recently emerged, are ever evolving and may disappear as quickly as they rise. This poses a risk for continuity and a need for a more flexible way of working.
3. Big Data studies3.1 Sources
In this chapter, we discuss three typical examples of Big Data research conducted at Statistics Netherlands. Other Big Data related studies performed at our office include internet robots, scanner data and satellite images. Further opportunities like analysing financial transactions are currently being studied, the first challenge often being to get access to the data. Most of the above examples are still in the research phase, apart from scanner data which has been in production for ten years now. Internet robots for the housing market are on the verge of being implemented in production. Note that administrative data is usually not considered as Big Data, but the larger administrative sources like the population register, VAT data and wages and salaries records could be interpreted as such. Looking at these more traditional sources from a Big Data point of view may provide new insights.
3.2 Road sensor data
In the Netherlands, there are more than 60,000 road sensors of which 20,000 are positioned on the Dutch highways.
These sensors detect the number of passing vehicles in various length classes each minute. This results in a total of 230 million records a day for the Highway sensors alone. The data are collected and stored by the National Data Warehouse for Traffic Information (NDW, www.ndw.nu/en/), a government body which provides the data to Statistics Netherlands. Since the data cannot be related back to individual vehicles, privacy concerns do not apply.
The latter makes this data set attractive for experimentation. The most important issue we ran into while studying road sensor data was the fact that the quality of the data fluctuates tremendously. For some sensors, data for many minutes are not available and, because of the stochastic nature of the arrival times of vehicles at a road sensor, it is hard to directly derive the number of vehicles missing during these minutes. For this purpose, an adaptive filter was developed that is tuned to the stochastic behaviour of the arrival times of the vehicles at the sensor (Puts et al., 2014). The quality of the data not only varies per minute but also per day (Fig. 3.2-1). By correcting for missing data and combining the daily profiles provided by sensors on the same road sections, the coverage and quality of the data is improved. In this way we are able to make traffic indices that describe the regional situation, at the NUTS-3 level, on the Dutch roads. Combining these regional findings gives a very good impression of the state of the country concerning road traffic (Daas et al., 2014).
Figure 3.2-1 Daily profiles of a road sensor on the IJsselmeer dam (“Afsluitdijk”) during 196 subsequent days.
3.3 Mobile phone data Nowadays, people carry mobile phones with them everywhere and use their phones often throughout the day. To manage the phone traffic, a lot of data needs to be processed by mobile phone companies. These data are very closely associated with behaviour of people; behaviour that is of interest for official statistics. For example, the traffic is relayed through geographically distributed phone masts, which enables determination of the location of phone users. The relaying mast, however, may change several times during a call. Through a three-party contract, Statistics Netherlands got access to call detail records (CDR) data from a Dutch mobile phone company with a market share of approximately one third of the Dutch mobile phone market. The CDR data amounts to 115 million records a day and contains information on both Dutch and roaming users of their network. The anonymized CDR micro data were processed by a specialized intermediate company, according to queries specified by Statistics Netherlands. Only aggregated results were forwarded to Statistics Netherlands as agreed to protect privacy. Several uses for official statistics were studied, including inbound tourism (Heerschap et al., 2014) and daytime population (Tennekes and Offermans, 2014). In Figure 3.3-1 an example is shown of CDR-data applied to inbound ‘tourism’. In this figure the activity of mobile phones assigned to one of the European countries involved in the European League Final, held in 2013 on May 15th, in the Amsterdam arena. The most striking finding is the fact that the activity of mobile phones from that particular country around that date is much higher than the activity of such phones in the remainder of the period studied. It nicely illustrates tourists visiting our country for a particular event during a very short period (Heerschap et al., 2014). It is highly likely that the majority of these visitors are not included in the official, accommodation based, tourism statistics.