06 Sep 2021

Steven J. Klees

The Limits of Social Science Empiricism and Evidence-Based Policy: GEEAP, SABER, Dashboards and More

Abstract: In this NORRAG Highlights, Steven J. Klees, Distinguished Scholar-Teacher and Professor of International Education Policy at the University of Maryland, looks at the limitations of “Evidence Based Policy” and “Best Practices” through the examination of the recent NORRAG blog of Joel Samoff, as well as his own study on the World Bank’s Systems Approach for Better Education Results (SABER) initiative. He argues that attempts to take the politics out of public policy and replace it with scientific evidence, such as SABER and GEEAP are flawed, as public policy is centered around debate and participation rather than evidence and universal agreement.

For the last decade or two, the term “evidence-based policy” has become ubiquitous. In some ways, this rather common-sense idea is a legacy of the post-World War II efforts to develop what were called the “policy sciences.” These held the promise and premise to take some of the politics out of public policy choices by using science and expertise to discover facts. Most particularly, these were causal facts that measured the impact of policies. Politics could still argue about the values of and values in policies, but science could and would determine best practice, at least in so far as the impact of policies. The policy sciences have been critiqued since their inception, especially the rational model which dominated, but their simplistic logic has persisted. The questions I ask here are, Whose evidence and what kind of evidence should count?

Joel Samoff raises these questions in his recent NORRAG blog about GEEAP, the World Bank’s and UK Aid’s new Global Education Evidence Advisory Panel. Its sponsors purports to have created an independent body composed of “leading experts from around the world” to “provide much needed guidance to help busy policymakers make sense of the evidence.” Their first report is titled “Cost-Effective Approaches to Improve Global Learning: What Does Recent Evidence Tell us are ‘Smart Buys’ for Improving Learning in Low- and Middle-Income Countries?”

Samoff offers three critiques: the general problems with the search for “best practices;” the focus on RCTs as the best evidence; and the treatment of education as a commodity. I will elaborate on these shortly, especially the first two, but to begin to see the problem, it is instructive to look at some of GEEAP’s recommendations. To me, a number of their recommendations contradict what I know from my reading of relevant research and common sense. For example, listed as “good buys” are scripted lesson plans for teachers and ability grouping or tracking for students. Both are and should be controversial. Listed as a “promising” buy are teacher accountability and incentive reforms, also controversial. And listed as “bad buys” are investing in textbooks, reductions in class size, increases in teacher salaries, and cash transfers to students and their families, all of which are seen by many as essential to improving education. All these recommendations are qualified briefly, but the gist to me is they closely follow the Bank’s neoliberal script under the guise of an “independent body” using supposedly objective evidence.

Let me elaborate by switching gears and reporting on a study I did with colleagues on the World Bank’s Systems Approach for Better Education Results (SABER) initiative. SABER began in 2011 to amass “the best available evidence on what works” in order to “assess and benchmark systems against global best practices.” SABER has been a massive effort, producing over 16,000 indicators of what the Bank considers best practice and, as of 2017, over 400 applications yielding over 190 country reports covering more than 130 countries. We organized our overall critique of SABER around Gita Steiner-Khamsi’s notion of three fundamental problems or “façades” in the pursuit of best practices in education. These are the façades of universality, rationality, and precision (and they closely parallel some of Samoff’s critiques of GEEAP).

A fundamental problem with SABER (and GEEAP) is the universality of its recommendations. The Bank has long been criticized for its “one-size-fits-all” approach, and SABER implies that all countries should try to attain “advanced” practices. SABER’s de-contextualized universal ranking of literally thousands of practices is so problematic as to invalidate the approach for this reason alone.

The façade of rationality was previewed by my point about GEEAP’s neoliberal ideology. It is not that SABER or GEEAP recommendations are irrational, but that the “guise of scientific rationality” is used for “political manipulation” (Steiner Khamsi, p. 21), i.e., to mask the extent to which recommendations of “best practices” are based on ideology, not evidence. The Bank is an ideological institution, really a right-wing think tank along the neoliberal lines of the U.S.’s Heritage Foundation or the American Enterprise Institute but with much more power and operating under the pretense of objectivity. Instead of evidence-based policy, too often we have policy-based evidence by which policy biases are used to cherry-pick supporting evidence.

The façade of precision, what Steiner-Khamsi (p. 21) sees as reifying “the uncontested authority attached to numbers” is perhaps the most important in debunking SABER, GEEAP, and other attempts to make evidence-based policy. To assess quantitatively the impact of an intervention, there are two ways to rule out confounding variables – statistical controls and experimental controls. Both are fundamentally problematic in theory and in practice. To trust in statistical controls via some form of regression analysis, you cannot just include ad hoc a few control variables but need three conditions: include all variables that affect the dependent variable, measure them correctly, and specify the proper functional form. These conditions never hold, and the result is different studies come to different conclusions, rather arbitrarily as a result of the idiosyncrasies of the variables, measures, and models used. In education, for example, hundreds of input-output studies (with student test scores as the dependent variable) offer no consistent findings (Hanushek, 2004; Klees, 2016).

Experimental controls via RCTs have been touted as a better strategy for impact assessment, indeed as the “gold standard” of research methods. However, they have been strongly critiqued for their lack of generalizability because they do not account for context (Deacon and Cartwright, 2016; Edwards, 2018). The validity of their findings is also suspect as too often control groups are not comparable to treatment groups and effect sizes are small (Pogrow, 2017; Samoff, Leer and Reddy, 2016). In practice, RCTs very often come to inconsistent and divergent conclusions (Evans and Popova, 2016). What this all comes down to again is that the evidence supporting the impact of policies is cherry-picked and “best practice” and “what works” are in the eye of the beholder.

Before discussing further the implications of this, let me mention the Bank’s recent successor or complement (it’s not yet clear which) to SABER, its Global Education Policy Dashboard (GEPD). The GEPD is aimed to offer “timely, cost-effective, comprehensive, and contextualized new information on the main determinants of learning outcomes throughout an education system.” This seems to be more of the same – a cheaper, streamlined version of SABER perhaps, but like SABER and GEEAP another mechanism for selling the Bank’s version of what best education practice should be around the globe.

It should be noted that even if universality, rationality, and precision were not problems, all three of these mechanisms focus their best practice recommendations only on the impact of interventions on student test scores, particularly on literacy and numeracy. If we are interested in education furthering other goals, GEEAP, SABER, and the Dashboard may well move us in the wrong direction. Education policies should be chosen that promote the broader well-being of children.

But universality, rationality, and precision are fundamental problems. We do not know scientifically – and perhaps cannot ever know – what interventions are best for improving either narrowly conceived test scores or the broader well-being of children. Yet, every day, education policymakers have to make decisions with those outcomes in mind. So, what do we do?

First, we cannot place our trust in the recommendations of GEEAP, SABER, GEPD, and the like. Their recommendations are ideologically biased and necessarily idiosyncratic. Even if one wanted to focus education on narrow measures of cognitive achievement, there is much research that shows that what GEEAP labels “bad buys” are actually good investments. But it is absurd to base education policy decisions only on their impact on such narrow measures. For example, even if it were true that students can learn basic skills in large classes, we need much smaller classes to develop the whole child. The much ballyhooed idea of “alignment” of all education system components and decisions to “learning” that the Bank has been pushing is nonsensical when you are defining learning narrowly and are basing it on biased research.

Indeed, we already know a lot about what educational investments are needed. In addition to much smaller classes, we need well-educated and well-motivated teachers. Research trying to connect teacher salary increases to test scores is irrelevant. Today, in too many countries, teaching is an occupation of last resort. We need to pay teachers well to turn it into the respected profession it once was. We need schools and classrooms to be attractive places for learning to take place. We need school feeding programs to combat widespread malnutrition. Every child in the world should have such an education. It is barbaric that they don’t.

The main response to this argument is that we don’t have the resources. But clearly, we do; the world just chooses to spend them elsewhere. The virtuoso, mostly economist, technicians will say that absent sufficient resources, we still have to decide whether to put limited resources in, for example, teacher salaries, reduced class sizes, or feeding programs. I agree, but my point here and in other work is that technical empirical virtuosity cannot contribute a lot to what must be a political choice in a context where the impact of any intervention is always contested.

For me, this view has at least two implications. First, instead of incorrectly offering policymakers the “most cost-effective” options, we need policy analysts or agencies that summarize the debates for and against different interventions. Second, these debates need to be made available and feed into more participatory, democratic, decision processes at local, national, and global levels. Deciding “what works” must become a democratic political activity, not a technical search for some elusive truth (Klees, 2017).

GEEAP, SABER, and GEPD all happen to be directed by the World Bank, but this is far from a Bank problem alone. Fundamentally, it is a social science problem. The social sciences have long been captured by physics envy. Some observer once pointed out that physics would fall apart if particles had intentions. We live on a planet with 7 billion human beings, all with different intentions living in very complex contexts. Our empirical ability to find regularities is very rudimentary. We do better in fields that have a base in the physical sciences, and we can generally believe things like wearing masks helps protect against virus transmission or that human activity is causing global warming. Of course, even these are contested by some. But agreed upon findings in the social sciences are much rarer – if indeed there are any significant ones. The promise of the policy sciences – that social science could give us clear facts – is belied in theory and in practice as I have argued here and elsewhere (Klees, 2020). We need to recognize that and be much more modest in our claims and much more aggressive in ensuring that our policy choices are made with widespread debate and participation.

About the author: Steven Klees is Distinguished Scholar-Teacher and Professor of International Education Policy at the University of Maryland. He is an Honorary Fellow and former president of the Comparative and International Education Society. He is the author of the book and blog titled The Conscience of a Progressive. His research interests are broadly concerned with the political economy of and alternatives to education and development.

(Visited 856 times, 1 visits today)

Filter by type

Filter by themes

AT A GLANCE

PEOPLE OF NORRAG

NEWS AND HIGHLIGHTS

THEMES