Having good data is crucial to evaluate a policy or programme, and new data sources and data science analytical tools have much to offer. In my previous post, I laid out some big data basics, and introduced some of the resources LEPs and local authorities can use. This post looks at how these might be applied to real-world evaluations, and the challenges involved.
Local policymakers, especially in big cities, are already starting to use big data sources and analytics for planning and management (see this FT story for an example). Evaluations exploiting big data are still thin on the ground, however, and that’s a shame. So here are some pointers.
First, big data will not evaluate the policy for you. Fancy data is no substitute for good research design. Evaluators still need to follow our advice and refer to a theory of change, work out a framework to test that theory, then apply appropriate data. In the tech incubator example above, leveraging big data is a crucial part of the research design; but it’s the RCT element that allows us to identify impacts. Data is not the answer, though it can help us find it.
Second, take time to investigate sources, access and cost. Commercial big datasets aren’t always easily affordable to the public sector, and aren’t always designed for evaluation purposes. Think about access, and how this might change over time. Central and local government already produce large amounts of information for monitoring purposes. Audit what data you already hold; look at what commercial providers will give developers for free, for instance through APIs; collaborate with colleges and universities who are keen to provide data science students for work placements, and who can help you unlock these resources. Groups like the Open Data Institute and NESTA can also help. Academics also have privileged access to many administrative big data sets through the UK Data Service and its online secure lab.
Third, be realistic about data quality. In particular, commercial big data is still a complement to administrative datasets, rather than a substitute. In particular, many commercial resources aren’t as complete as they claim; for example, social media data is only as good as its user base, which is often skewed towards younger, whiter, better off social groups. Similarly, Crunchbase is a very promising resource for understanding local clusters, but as some colleagues and I are finding, it still has a lot of gaps. So as a minimum, substantial cleaning and validation is often required on the raw data: this paper walks you through some of the work required with one UK example. In the jargon, you also need to think about the ‘implicit sampling frame’ – for example, if Twitter data comes from Twitter users, (who represent a particular set of socio-economic groups) what can you learn from your results?
Think about whether you have the in-house capacity to do all of this. If not, find partners who can help you (see above) and whether you can trade data access for help with making sense of it.
Fourth, when you do have to collect your own information, make sure you can link it. As we have seen, it’s rare that one single dataset will give you everything you need, so patching several together is often necessary. And you will get far more value from your survey if you can also leverage other resources that already exist, especially if you can access these for free. Make sure you understand what linking variables exist. For firms, for example, Company Record Numbers (CRNs) allow you to match your survey respondents to information in Companies House, administrative data resources like the Business Structure Database and to many commercial big data sets. And make sure your survey asks respondents for permission to match their (anonymised) details to other datasets.
Fifth, to get the most out of your data, you’ll also need some data science. You may already have people with the right skills in-house. If not, reach out to collaborators: the same people who can help you access datasets can usually help you crunch it. And if you’re already confident with code, this paper by Hal Varian (PDF here) is an excellent overview of how techniques like classification trees can provide intuitive, visual tools for exploring logic models and presenting results, in ways that will have impact.