As we’ve made clear in our How To Evaluate series, decent data is a bedrock of good evaluation. We need to know about the people, firms or places affected by the policy we’re testing, as well as those in control groups. And crucially, we need to be able to follow treatment and control groups over time.
So how do we collect this? It’s tempting to gather it yourself. But this is expensive, especially in large-scale programmes, and may be hard to do in practice (as I’ll discuss below). Often, much of the information Local Authorities and LEPs need may already be available through secondary datasets. Rather than expensively re-inventing the wheel, local policymakers are usually better off working with what’s already out there, and gathering bespoke information only when essential.
Getting the most out of secondary data is easier than you think, though it requires going beyond the usual sources for local government, such as Nomis, Neighbourhood Statistics, or the new ONS website. In particular, different kinds of ‘big data’ can help, from the private sector, government and even from the local environment.
In this blog I’ll take a look at the concept of ‘big data’ and the kinds of datasets and analytical tools that local authorities and LEPs can draw on. In a later blog I’ll discusses how these might be applied to real-world evaluations, and some of the challenges involved.
So what is ‘big data’? A common way to talk about it is with reference to the Four V’s. The first three are: volume (massive, often millions or billions of observations); velocity (available at real time or close to it) and variety (internet search records, company customer data or large government datasets). The fourth V is veracity – raw data is often ‘unstructured’ in the jargon. At the least, it needs cleaning and validating before it’s good to go. More worryingly, some datasets may be incomplete, systematically missing out some places and people. More on all this in the next post.
For our purposes, there are three flavours of big data. The first is commercial: business-built datasets available for free or low-ish cost to researchers and government, typically through APIs. Some of this is ‘raw’ output from sites like Flickr, Twitter or Yelp; some of it is packaged up, as with Google Trends. Academics are already using these sources, most commonly to explore urban amenities or quality of life [PDF] that regular data can’t see.
Paid-for offerings often use machine-learning to turn raw data into modelled variables. The 2016 Tech Nation report, which looks at UK technology clusters, used machine-learnt data from Growth Intelligence alongside other commercial datasets on tech labour markets (via online job ads) and professional communities (via websites for hosting code, and sites for organising meetups). [Full disclosure: I’m working with Growth Intelligence data at the moment, and Google has funded research I’ve done in the past.]
The second kind of big data comes from towns and cities themselves, usually from sensor networks embedded in buses, trains or light rail, as well as static objects in the urban environment. Some of this is already being collected by local government and made public; for example, travel apps like Citymapper are powered by public datasets (you can see some examples in the London Datastore).
The third flavour is administrative data. This may surprise you, but governments already collect huge amounts of information on people and firms through official surveys like the National Pupil Database, Annual Population Survey and Business Structure Database, which covers 99% of UK enterprises. These are available through the UK Data Service. These massive datasets are increasingly available in raw form, and can often be linked together. What’s more, they are generally free to policymakers and academic researchers. ‘Administrative big data’ on firms is already widely used by academics. For example, I’m currently working with a panel of 10 million observations built from the Business Structure Database, Companies House and Growth Intelligence data. And data on individuals is getting better. As Henry points out here, we can now connect DWP, HMRC and DfE datasets, which will help us test the impact of apprenticeships and employment training programmes far better than previously.
What can these datasets get you that survey data can’t? Let’s take a couple of examples. First, a common problem in many evaluations is that we need to follow people or firms some time after the policy has finished, and response rates to surveys often drop off markedly (even if people are paid to respond, as in this study). The worse this ‘attrition’ problem gets, the harder it is to assess the impact of the policy. To get around this, researchers in one US study used big administrative datasets to follow participants in a labour market RCT many months after the intervention. Similarly, in a new RCT with a UK tech incubator, we plan to use industry big data resources such as Crunchbase to follow participants in ways that would be impossible otherwise.
Second, big data can also shed light on things that we can’t easily see. Alongside the Twitter / Flickr / Yelp examples above, in a new paper Ed Glaeser and colleagues champion the use of Google Street View data to model neighbourhood ‘quality’, and predict local property prices. Such clever analytics can reveal further insights – for example, machine learning can model emerging sectors that don’t show up in SIC codes, such as ‘digital economy’ firms (see here), or provide alternative measures of innovation such as company product launches. (However, Glaeser and co. don’t dwell on some of the challenges in bringing these kind of data to the table: I’ll discuss more of these in the next post).
So crucially, we are talking about more than datasets here – as Hal Varian explains [PDF here], the data science field also covers data management tools (databases such as Hadoop, which handle massive volumes of information), access tools (such as APIs and web crawlers) and analytical techniques (notably, machine learning routines). Some of this is way outside what we need for evaluation. But overall, it’s this combination of datasets, access platforms and analytics that offers real potential for local policymakers.