Just before Christmas, the National Audit Office published its long awaited report on Evaluation in Government. I was part of the LSE team that contributed to that report, specifically in terms of assessing the quality of evaluations. As this forms part of the wider remit of the What Works Centre, I thought I’d highlight both the report (for those not aware of it) and some of our central findings.
The team – Steve Gibbons, Sandra McNally and I – looked at 35 UK government evaluations covering active labour markets, business support, education and local economic growth (including regeneration). We picked these four because policies in these areas are targeted differently (e.g. some at firms, some at individuals) – which then helps illustrate how evaluation deals with some crucial methodological issues. These are also areas where external perception of quality varies markedly – and where our team had considerable expertise.
We were asked to highlight the strengths and weaknesses of evaluations, to assess the robustness and the usefulness to policy makers and to suggest improvements. We used the Scientific Maryland Scale to rank studies – the same tool we’re using for systematic reviews in the What Works Centre.
What did we find? First, the quality of evaluation varies widely both within and across these four policy areas. For example, we found evidence of high quality evaluations in the areas of active labour markets and education. In contrast, evaluations in the areas of business support and local economic growth were considerably weaker. Second, that quality range really matters for policymakers. On the basis of the reports we saw, we judged that none of the business support or local economic growth evaluations provided convincing evidence of policy impacts. In contrast, 6 out of 9 education reports and 7 out of 10 labour market reports were good enough to have some confidence in policy impacts.
How can policy evaluation get better? We think that using a control group (or a counterfactual) should be considered a necessary (although not sufficient) requirement for robust impact assessment and value for money calculations. Business support and spatial policy evaluations, in particular, could make better use of use administrative data and improved evaluation techniques to construct these counterfactuals.
We also make some more technical recommendations about how to handle policies where people can opt in; improving inference (i.e. how certain we are about the effects of policy) as well as the interpretation of impact estimates (do they apply to everyone ‘treated’, or just a subset?). More care should be taken to distinguish between the analysis of programme delivery (processes) and the assessment of impact and value for money (outcomes). Finally, every impact evaluation needs a technical appendix written for a specialist audience.
Overall, our verdict is mixed. We found a lot of very good evaluations; and many others that could be easily improved. For other areas, notably evaluations of business support and spatial policies, the picture is more worrying. Of course, these are also some of the hardest kinds of policies to evaluate robustly, but they are also central elements in many councils and LEPs’ growth strategies. Through our reviews, user engagement and demonstrators we will be tackling these issues head-on in the months to come.