Single Blog Title

This is a single blog caption
12 Mar 2019
D. Brent Edwards Jr.

Big Data? Big Deal: The Inability of Big Data to Escape the Limitations of Impact Evaluation

This NORRAG Highlights is published by D. Brent Edwards Jr., Assistant Professor of Theory and Methodology in the Study of Education at the University of Hawaii, Manoa. In this post, the author looks at methods used to process and transform education-related data called “impact evaluations”. Big data in the field of global education policy uses these methods, namely regression analysis and randomized controlled trials (RCTs), despite them having serious limitations. The author cautions that although big data and the methods of impact evaluation can be useful inputs for policy discussions, it is important that their limit be understood and that in the context of policy reform and global governance other qualitative strategies be used by decision-makers.

We are living in the age of big data, where school systems around the world are “drowning” in information, thanks to initiatives such as PISA and other assessments of student achievement which gather an incredible amount of test score, student, family, and school data [1]. Yet there has not been sufficient attention given to the quantitative methods that are used to process and transform this data in order to arrive at findings related to “what works.” Although the idea of big data and the ability to process it is receiving more attention, the underlying point here is that these new initiatives and advances in data collection are still dependent on methods that have serious limitations. These methods fall under the label of “impact evaluations” and can be grouped into two categories: regression analysis and randomized controlled trials (RCTs), with the latter being seen as the current gold standard for revealing causal links between interventions and their outcomes.

The Problems of Regression Analysis

The basic idea of regression analysis is that a researcher can identify the impact of a given program or variable on outcomes of interest (e.g., student test scores). In order to do so, however, the researcher needs to control for all other variables that might affect the outcome. Yet, in reality, it is rarely, if ever, possible to include of all the relevant variables as statistical controls. Moreover, the included variables should be measured appropriately and their interrelationships should be correctly modeled [2]. Despite the inability to meet these requirements, researchers continue to employ regression analysis. And it’s not as if researchers are unaware of these limitations. Take, for example, a paper by a senior economist with the OECD on the impact of computer use on PISA scores. The author writes, “a comparison between those who use the computer … and those who do not … would be legitimate only if the students in these two groups had similar characteristics. Unfortunately, there are … major reasons to expect that this is not the case” [3]. The author then goes on to explain that “a positive correlation between computer use and student performance may simply capture the effects of a better family background” (p. 1), with the implication being that this is a problem because it is not possible to control for all the ways that “a better family background” affects, first, computer use and, subsequently, PISA scores. Or, in the words of the author, “observed differences in computer use … may reflect unobservable differences in students’ characteristics” (p. 2). Yet, in typical fashion, after conceding these points in the introduction, the researcher continues with the analysis and offers implications for policy.

The Limitations of RCTs

As for RCTs, the idea is that one can determine the effect of an intervention by randomly assigning participants to control and treatment groups and then by looking at the difference in the averages of the outcomes for the two groups. If done correctly, the differential results in the treatment group are attributable to the intervention, since randomization should make the control and treatment groups equal in terms of all relevant characteristics. However, in practice, one is rarely able to achieve a balanced and representative sample from the population of interest through random selection, since there are always differences along some variable or another. For example, when communities are assigned to treatment and control groups randomly, they may be equivalent in terms of some background factors, but there are many variables—related to students, their families, their communities, and the contexts in which they are embedded—that impact outcomes, and it is necessary that the two groups be identical in all ways, because when they are not, we can no longer attribute the observed results to the intervention under study, which is the whole point of RCTs in the first place.

Another problem has to do with unbiasedness of estimates. If we assume that randomization has worked, then “the difference in means between the treatments and controls is an estimate of the average treatment effect among the treated,” as Princeton economist Angus Deaton has noted [4]. However, while this fact is often the basis for interest in RCTs, what this focus fails to highlight is that randomization does not necessarily mean that coefficients of included variables are unbiased in each experiment, only that, on average, across replicated experiments, they are. As Angus Deaton has also noted: “Unbiasedness says that, if we were to repeat the trial many times, we would be right on average. Yet we are almost never in such a situation, and with only one trial (as is virtually always the case) unbiasedness does nothing to prevent our single estimate from being very far away from the truth” [5].

A third key weakness of RCTs is that they only provide estimates of mean treatment effects, since the outcomes of the individuals in the intervention and control groups are averaged. The difference in these means is then taken to be the average effect (hopefully benefit) that results for those who received the treatment. However, this average could be masking a situation where “nearly all of the population is hurt with a few receiving very large benefits” (Deaton, 2010, p. 439). This situation clearly presents problems for decision-makers, whether it be a doctor who must decide what is best for an individual patient or a policymaker who must make choices about public policy. The presence of an outlier in the treatment or control group could skew the overall averages and can lead to the implementation (or not) of a program that could be beneficial for everyone (or that could do harm).

Fourth, and finally, there are problems when it comes to the generalizability of RCT results. While RCTs strive to ensure that their results are internally valid, we are not automatically able to say anything about the transferability of findings to other locations. Interestingly—and ironically—making the case for the applicability of results to other locales requires that scholars resort to the kinds of comparisons and qualitative methods that RCT advocates sought to avoid in the first place. RCTs are designed to identify the causal impact of interventions for particular samples; their design does not ensure that the results are transferable. When it comes to policy transfer, RCTs are no more helpful than other forms of research, and may be less so, since RCTs are not generalizable and since the results do not tell us anything about program implementation.

It is suggested here that both the producers and consumers of RCTs should strive to obtain a critical understanding of the evaluation context, meaning an understanding which goes beyond the surface level to include socio-cultural, structural, historical, and political aspects. An understanding of this nature is seen as necessary in order to perceive the ways that these (often ignored) aspects of context can creep into and can affect not only the behavior of study participants but also the outcomes of focus.

Big Data Depends on Impact Evaluation Methods

To be sure, the trend of big data in the field of global education policy is at odds with the critique presented above. A few examples not only make this point clear, but also emphasize the connection between impact evaluation and big data. In the case of the World Bank, it has expanded on its history of offering policy advice based on test scores. That is, after decades of promoting standardized testing, the World Bank made “learning for all” the core idea of its most recent education sector strategy paper, and then followed up on this idea by creating “the largest globally comparable panel database of education quality,” which covers the years 1965-2015 and includes 163 countries [6]. And now, the OECD and the World Bank are working together to expand PISA’s reach by adapting it for “developing” (i.e., middle- and low-income) countries. This new initiative, known as PISA for Development (or PISA-D), will allow education in poor and rich countries alike to be compared and governed by a single exam. But PISA-D is only one of the OECD’s initiatives to expand its influence via “governance by numbers” [7]. To that end, it has developed tests for individual schools (as opposed to its traditional focus on country-level samples), adult competencies, and higher education.

While the data in the above examples require manipulation by regression analysis in order to produce findings about “what works,” there are other initiatives that also incorporate RCTs. For example, an initiative at Stanford University that seeks to make the war on poverty “smart,” has brought together a multidisciplinary team of professors who will “use machine learning to cull through huge data sets to understand the many variables that lead to, perpetuate, and potentially even prevent impoverishment” [8]. As described in a recent article about this project, it seeks to bring computer scientists together to “apply data analytics and machine learning to the vast web of information collected,” with the ultimate goal being to apply “predictive models that suggest which interventions are likely to work best in a given context” (Walsh, 2019). Methodologically, to cull through the datasets and identify variables associated with poverty, they will undoubtedly use some form of regression analysis; this phase will then be followed by another where the team designs and tests—via RCTs—interventions that could help individuals escape poverty.

Moving Beyond Impact Evaluation Methods

Although the idea of big data and the ability to process it is receiving more attention, we emphasize our introductory point that these new initiatives and advances in data collection are still dependent on methods that have serious limitations. Many new projects launched by international organizations together with governments are certainly ambitious and impressive in their scope, but they are presented with a veneer of unambiguity and objectiveness that they do not deserve. To that end, not only do proponents of big data avoid or downplay discussion of the methodological pitfalls of impact evaluation, they also fail to acknowledge the political and organizational dynamics that affect the collection of data, for example, when politicians assign communities to participate in interventions (as opposed to assigning them randomly), as well as the interpretation of findings, which are necessarily constrained by institutional agendas and the incentive structure in which the researcher is embedded [9].

Without a doubt, big data and the methods of impact evaluation can be useful inputs for policy discussions in the context of policy reform and global governance. However, it is essential that their limitations be understood by those who produce and use them for policy reform, as well as by actors at all levels. To the extent that such methods will be increasingly used to guide public policy around the globe, it is essential that stakeholders inside and outside education systems are informed about their weaknesses—methodologically and in terms of their inability to take the politics out of policymaking. For, indeed, while the promises of big data are seductive, by and large, they have not replaced the human element of decision making. That is, while the data may construct a certain worldview, and while the interpretation of that data, whether by a researcher or a computer program, may suggest certain policy measures, it is unlikely that major policy decisions will be automated any time soon. The implication, then, is that policymakers and the methods on which they rely should be held to a higher standard, one that goes beyond the combination of big data and impact evaluation. Rather than leading to best or better practices, the intersection of big data and impact evaluation techniques can lead to studies and, subsequently, reforms that are costly, detrimental, contextually irrelevant, and/or ineffective.

Attention should be directed at tempering the use of these methods, complementing them with other (qualitative) strategies, and changing the nature of decision-making from an exercise that seeks to be technocratic to one that is openly and unavoidably political. Big data does not deliver policy without politics, rather, it hides much of the biases and politics of the process behind the appearance of technicization. In the current context of global governance, under the influence of international institutions and cooperating national governments, this state of affairs it not likely to produce meaningful benefits for the most marginalized, since neither of these actors are driven by a political calculus that responds to the needs of the poor. In this sense, critically understanding the limitations of big data and impact evaluation is only a first step to challenging the status quo of global governance more generally—that is, to thinking beyond the methods, politics, and worldviews that currently permeate and constrain the field of global education policy.

About the Author

D. Brent Edwards Jr. is an Assistant Professor of Theory and Methodology in the Study of Education at the University of Hawaii, Manoa. His work focuses on the global governance of education, methodological critiques of best practice research, and the political economy of education reform.

This Blog post stems from a Book “Global Education Policy, Impact Evaluations, and Alternatives: The Political Economy of Knowledge Production” (Palgrave Macmillan, 2018), available at the following link:


[1] Gorur, R., Sellar, S., & Steiner-Khamsi, G. (2018). “Big data and even bigger consequences.” In R. Gorur, S. Sellar, & G. Steiner-Khamsi (Eds.), World Yearbook of Education 2019: Comparative methodology in the era of big data and global networks. New York: Routledge.

[2] Klees, S. (2016). Inferences from regression analysis: Are they valid? Real-world Economics Review, 74, 85-97. Available at:

[3] Spiezia, V. (n.d.). Does computer use increase educational achievements? Student-level evidence from PISA. Available at:

[4] Deaton, A. (2010). Instruments, randomization, and learning about development. Journal of Economic Literature, 48 (2), 424-455. Available at:

[5] Deaton, A. & Cartwright, N. (2016). The limitations of ramdomised controlled trials. Vox. Available at:

[6] World Bank. (2018). Global data set on education quality. Washington, D.C.: Author. Available at:

[7] Grek, S. (2009). Governing by numbers: The PISA ‘effect’ in Europe, Journal of Education Policy, 24 (1), 23-37. Available at:

[8] Walsh, D. (2019). Solving poverty using the tools of Silicon Valley. Stanford Graduate School of Business. Available at:

[9] Edwards Jr., D. B. (2018). Global education policy, impact evaluations, and alternatives: The political economy of knowledge production. New York: Palgrave MacMillan. Available at:

Contribute: The NORRAG Blog provides a platform for debate and ideas exchange for education stakeholders. Therefore if you would like to contribute to the discussion by writing your own blog post please visit our dedicated contribute page for detailed instructions on how to submit.

Disclaimer: NORRAG’s blog offers a space for dialogue about issues, research and opinion on education and development. The views and factual claims made in NORRAG posts are the responsibility of their authors and are not necessarily representative of NORRAG’s opinion, policy or activities.

(Visited 777 times, 1 visits today)
Sub Menu
Back to top