Causal inference for program theory evaluation

Debates about ways of investigating cause and effect have been a feature of many international evaluation meetings in recent years.  It’s something I’ve been wrestling with while finishing a forthcoming book on program theory with Sue Funnell (Purposeful Program Theory: Effective Use of  Theories of Change and Logic Models, to be published by Jossey-Bass).

Pic by ralphbijker

Jane raised the issue of causal inference in a post back in February, in a recent presentation and her book Evaluation Methodology Basics Chapter 5 on Causation. By causal inference we mean both causal attribution (working out what was THE cause) and causal contribution (identifying what was one or more of the causes that together produced the outcomes and impacts).

We both agree that it is an issue that needs to be tackled in evaluation, in ways that are commensurate with the available evaluation resources, and that don’t assume it’s simply a matter of using a particular research design (such as using Randomised Controlled Trials) or data collection method (such as stakeholder interviews).

Where we have taken a different tack is in strategies for causal inference.

Jane’s take on causal inference for evaluation

Jane recently (in her book, Evaluation Methodology Basics) shared a list of 8 strategies for causal inference that can be used for a ‘patchwork’ or bricolage approach to causal inference:

  1. Ask those who have observed or experienced the causation first-hand
  2. Check if the content of the intervention (or, supposed cause) matches the nature of the outcome
  3. Look for distinctive effect patterns (Scriven’s modus operandi method)
  4. Check whether the timing of the outcomes makes sense
  5. Look at the relationship between “dose” and “response”
  6. Use a comparison or control (RCTs or quasi-experimental designs)
  7. Control statistically for extraneous variables
  8. Identify and check the causal mechanisms

My take on causal inference for program theory evaluations

Looking at causal inference for program theory evaluations, I’ve been thinking more broadly about three components of causal inference:

  1. Congruence with the program theory – do the results match the program theory?
  2. Counterfactual comparisons – what would have happened without the intervention?
  3. Critical review – what are plausible alternative explanations for the results?

Combining the two frameworks

Combining these two framings of the issue has identified some additional techniques that should be added to the repertoire (such as techniques that compare results with those predicted by statistical models of the theory or the predictions of experts) and explains why experimental design by itself is inadequate for causal inference – this deals with the counterfactual but may not provide sufficient information on the factual (what happened?) and needs to be complemented by critical review. Some techniques (such as asking participants) can provide all three types of evidence, although none by itself will be sufficient.

A bricolage approach, covering these three different components, with triangulation of sources, seems to be what is needed for really credible causal inference.

What do you think? (And, yes, the cogs in the picture show what happens when there is too much interaction between causal forces).

from: Purposeful Program Theory: Effective Use of  Theories of Change and Logic Models, by Patricia J. Rogers and Sue C. Funnell, ISBN: 9780470478578, John Wiley/Jossey-Bass, (in press).



Do the results match the program theory?

Counterfactual comparison

What would have happened without the intervention?

Critical review

Are there other plausible explanations of the results?

Comparing achievement of intermediate outcomes with achievement of final outcomesDisaggregating results for complicated interventions

Statistically controlling for extraneous variables

Modus operandi

Comparing timing of outcomes with program theory

Comparing dose-response patterns with program theory

Comparing statistical model with actual results

Comparing expert predictions with actual results

Asking participants

Asking other key informants

Making comparisons across different cases

Control group or comparison groupComparing the trajectory before and after the intervention

Thought experiments to develop plausible alternative scenarios

Asking participants

Asking other key informants

Making comparisons across different cases

Identifying alternative explanations and seeing if they can be ruled outIdentifying and explaining exceptions

Comparing expert predictions with actual results

Asking participants

Asking other key informants

Making comparisons across different cases

14 comments to Causal inference for program theory evaluation

  • David Turner

    I’ve often thought that ruling out plausible rival hypotheses (your critical review) gets too little attention. This can require program theories for the rival explanations themselves, but alternative theories can be useful. A dash of common sense is also required here, so you don’t get bogged down in exploring a range of possible explanations (the key term is plausible).

  • Chad Green

    There are a number of different angles from which to address the question. The simplest place to start, perhaps, is to view your framework from the dualistic perspective of certainty/uncertainty.

    What’s more important, things or the space between things? The answer is the latter if you’re a systems thinker, of course. In other words, uncertainty is more valuable because deeper truths and influences are to be found there. For instance, what value do we expect to extract from the Large Hadron Collider, a gigantic (17 mile) instrument built in collaboration with over 10,000 scientists from over 100 countries?

    To apply uncertainty to your daily work requires a certain level of “permanent curiosity” (Chambers, 2008), or the view that the most important constructs are always hidden and have yet to be discovered. It becomes a causal force of its own. For example, with respect to your framework, who is the change agent that guides the development of the logic model, the program objectives, methods, instrumentation, fidelity of implementation, and documentation of results? Who empowers the program staff and stakeholders to improve their processes and outcomes? Who clarifies the integrity/identity of the organization by identifying and operationalizing its values, which then guides the organization to the resonant emergence of complex adaptive behaviors?

    What I’m trying to say is that we as professional evaluators, our tools (language, methods, instruments), and the evaluands alike all operate as an integrated circuit. (In physics it is called the observer effect.) Thus if we acknowledge and celebrate our role as a causal force through our publications (online/offline), trainings, conferences, principles, methods, instruments, and daily interactions, then I think we will move closer to another Copernican revolution of which Dr. Scriven spoke earlier. Of course, in order to come full circle again requires that we think outside of the current mental models to the extent that our state of permanent curiosity benefits our constituents, colleagues, the profession, and society as a whole.


  • Laurie Porima

    I would have to agree with David “the key term is plausible.” And that’s where we get into Patricia’s brocolage comment state. Though when is enough, enough? At some stage the evaluator has to make that decn. Is the time to make the decn when resources have been expended? If that’s so then is this genuine evaluation? Or evaluation determined by available resources? It’s the same for cultural aspects specific to New Zealand that have been mentioned in previous blogs on this site- “when is tikanga Maori (Maori ways of doing things)sufficient to help explain something? Whose interpretation is valid? We could go deeper. Jane’s framework has 8 items. Patricia’s has three broad items that give flexibiltiy to explore deeper. Both frameworks (JD & PR)indicate that a methodology is present and evidenced by the methods used. And they do help explain causal inference. Was the genesis of this thread around having more than just key informant information to make attribution to causal inference? When will this end. My conclusion is that the evaluator makes that decn (when enough is enough). So long as there is sufficient evidence of exploration the stake must be put in the ground at some point! That for me is an aspect of “genuine evaluation” in action! Otherwise somebody will come along and add 2 more items to Jane’s framework, perhaps four more to Patricia’s framework. It’s happened before. First there was formative and summative evaluation, then someone came along and added process and then outcome evaluation. Then along came empowerment, kaupapa Maori and now Developmental evaluations. What next? That is the beauty of our profession and a clear indication that genuine evaluation practice is taking place. When we are free to explore deeper, the evidence that we have collected! Scutinzation is a given. It’s the ability to handle it and handle the evaluation critics that are not happy with subjective conclusions and prefer to revert to the RCT regime – that is the killer of genuine evaluation!

  • Michael Scriven

    The way this looks to me, it may still be worth trying for a single overarching model, into which one then fits the bits. Of course, the RCT model can’t fill that role, so I’ve suggested in a JMDE article what I call the General Elimination Model, which works by identifying a list of all possible causes in this context, and then knocking them out where possible, using the modus operandi and other techniques listed in Jane and Patricia’s approach. This is close to Patricia’s critical review (and to the philosophy of science model called ‘inference to the best explanation’), and treats the useful details in Jane’s bricolage as hints for finding candidates for the LOAPC. Why not add the other two elements in Patricia’s model? Because the first one, the appeal to program theory, can’t be done without a program theory, and you don’t need one of those to do evaluation (though of course it’s nice to have one and adds to what you can do for the client), so should not make the concept of cause–which you do need–depend on it; and because the second one, the counterfactual, is also not applicable in many cases (namely all cases of overdetermination) and hence also can’t be part of the meaning of causation that we need when talking about impacts as caused by the intervention. But both of these other elements often suggest useful entries for the LOAPC, so it’s good to keep them in the hints list.

    Of course, RCT and other goodies like interrupted time series are very nice ways to eliminate a bunch of possible causes, so they stay in the toolkit for use when needed and cost-feasible/cost-effective. But the GEM approach trumps the ‘gold standard’ simply because it’s the logical framework that is always there. Or so it seems to me…. at the moment……

    Michael Scriven

  • Patricia Rogers

    Thanks for these comments and insights.

    Chad has pointed to the issue of uncertainty, which seems to be at the heart of many problematic approaches to evaluation. Reducing uncertainty, or helping us to effectively act or decide under conditions of uncertainty, seem to me to be appropriate goals for evaluation – but sometime evaluation is intended to remove apparent uncertainty, when this will be problematic. Efforts to find out ‘What works’ can do more harm than good if they misrepresent the variation (What works for whom in what circumstances) and the uncertainty (What happens in the long-term, what are the contributions of particular factors) that exists.

    Laurie has raised the important questions of ‘How much investigation is enough’ and (implicitly) ‘Who decides this?’. Like most evaluation designs it often requires negotiation between the evaluation commissioner and the evaluator (if they are separate entities). Going back to the three-part framework, I often see that evaluation designs pay attention to the first and second parts (congruency and counter-factual). Evaluation plans usually state which variables they will measure or describe, with reference to a program theory in many cases, and they often use a research design that addresses counter-factual issues. But how many include time and money for the sort of open-ended analysis required to identify and check out alternative explanations – which might require a quick re-analysis of the data, or some additional data collection in order to rule out a plausible alternative explanation. It seems that evaluation budgets should include some time that is reserved for this purpose.

    Michael reminds us of the problem of overdetermination for the counter-factual approach to causal inference (for example, having multiple members of a firing squad, any of whom would be sufficient). This seems to be completely ignored in the discussions of causal attribution in development evaluation. I wonder if this is because it is assumed that if a development intervention had not happened, then nothing similar would have happened – and how reasonable is it to assume this? The other big challenge to the counter-factual approach is uncertainty. What would have happened if you had/hadn’t married your first girl/boyfriend? The array of possibilities is so vast, that it is not meaningful to talk about THE counter-factual. Comparing plausible scenarios seems to be a better option.

  • David Earle

    The thing that bothers me most about causal inference is the unstated assumption of simple cause and effect, accompanied with an assumption that the important effects are substantively measurable. That is, programme A directly causes a shift from prior state x to post state y.

    As discussed above, there a lot of good techniques to filter out ‘other things going on’ and extract the contribution specifically attributable to the programme (over and above all else). The danger is that the rest is then discarded under various headings, such as selection effects, placebo effects etc.

    But “all else” can cover complex (and sometime quite simple) interactions that are very important to understanding the full effect of the programme. By just trying to distill a linear causation story out of the evaluation, we loose the back and forth interactions that actually enable the programme to work.

    So what I would like to see more of is discussion of the net effects attributable to the programme (which the techniques above are very useful for) – with a richer discussion of the multiple interactions involved with the programme. And less of a tendency to impose simplistic cause and effect stories onto complex interventions.

  • Irene Guijt

    Thanks for the thought-provoking discussions. I recognise my own situation in David Earle’s comment. Am coordinating the M&E of a large (11 country, 170+ organization, diffuse – policy influencing) program with multiple donors (each with own priorities and different time lines covering 2007 – 2014 time frame). And causal inference is important in that context to know, obviously, what to continue, change, stop, add to the mix of interventions. But also, clearly, challenging – ahum.

    Also, with a 120,000 USD annual M&E budget (that’s for every single cost item on M&E), there is also the tension between what is ideal (see this discussion) and what is feasible. So I’m also keen to know what are the real non-negotiables – and what can causally be claimed on the basis of that, and what are the ideal add-ons, and what could be claimed in terms of causal inference ‘truth’ on that basis.

    I think that part of the problem is that many in the evaluation profession in general claim too much solidity of findings on the basis of impartial investigation. But such impartial inquiries are inevitable given time, resources, etc. in many cases. Over-investing in zooming in on ascertaining that single solid cause effect story is very time-consuming and may mean that the richer discussion on multiple interactions (see Earle’s posting) does not get adequate time. Just pondering out loud on today’s challenges for me in relation to these discussions….

  • Chad Green

    I have two elements to add to the simplicity and utility of Patricia’s conceptual framework in the form of two self-reflection questions from the perspective of the evaluator as change agent:

    – Pre-evaluation: How do I visualize my role as a change agent given my level of involvement in the program? To what extent will my practices contribute to program outcomes and the overall organizational climate?
    – Post-evaluation: What outcomes and impacts could be attributed to my application of evaluation practices? What messages, both intentional and unintentional, did I convey in my capacity as a causal force for positive change?

    The purpose of these questions is to elicit double-loop learning (Argyris & Schön, 1974) on behalf of all change agents, not just the evaluator; in other words, for the evaluator to ensure that the evaluation is valued and worth his/her time, talent, and effort.

    As for the other comments, I would agree mostly with David Earle from the perspective that programs are merely reflections of the organizational context (i.e., social interactions). For example, alluding to my earlier post, let’s assume that rather than embracing uncertainty, an organization instead has an unspoken fear of it. Furthermore, let’s say this fear is sophisticated enough that the good intentions of the program designers are overshadowed by the unintentional messages communicated by program staff, stakeholders, and sponsors. (This is the sign of our times!) At some point it would behoove the evaluator, sooner rather than later, to switch gears and advocate instead for an organizational development intervention (reset).

    Or to look at it another way, let’s assume two organizational leaders with the same fidelity of program implementation. The first leader carefully created the conditions for success (purposefulness) whereas the second leader used sheer force of will (purposiveness). Which program is more effective from an RCT study approach?

    Speaking of causal wars, here’s a favorite quote of mine from gifted education that I think represents an elegant solution to the tension between behaviorism and cognitivism: “Intelligence is not how much you know or how fast you learn, but how you behave when you don’t know the answer” (Carol Morreale). What causes individuals to behave appropriately in times of uncertainty? Could it have something to do with our ongoing search for meaning in the human condition? How can organizations enable this process? Surely we should have some answers by now on how all these causal forces should converge. Daniel Pink elaborates in this animated video:

    Drive: The surprising truth about what motivates us


  • Chad Green

    Irene, I think it is possible to engage in rich discussion on the complexity of interactions even at your level of program implementation. In fact, I will be presenting on this topic at the next AEA conference. From my experience, it requires a combination of resources including:(a) conceptual framework capturing the ideal state of the system, (b) data collection methods and tools based on the operationalized framework, (c) documentation of findings, (d) creation of a situation model (Kintsch, 1988) that captures the current state of the mental model of the system, (e) a visualization of the model, and (f) a decision-making process that facilitates sense-making with the situation model for continuous process improvement. It sounds complicated but in reality the most challenging (and rewarding) component is the development of the framework.


  • Scott Bayley

    I like how Patricia is trying to disentangle this situation. I also want to propose another way of approaching this topic.

    It concerns me that popular textbooks and guidance papers on impact evaluation generally fall into the trap of considering the merits of particlar research designs/methods without actually discussing the evidentiary criteria for making causal inferences.

    According to the British economist and philosopher John Stuart Mill (1843) in order say that program (X) has caused impact (Y) three criterion need to be satisfied:

    1. ‘X’ occurs in time before ‘Y’ (temporal order)
    2. There is a relationship between participating in program ‘X’ and achieving impact ‘Y’ (covariation of X and Y)
    3. All non-program explanations for the relationship between ‘X’ and ‘Y’ can be ruled out (the relationship is then said to be non-spurious).

    In my view decontextualized debates about the merits of particular research designs are a misleading distraction. Our discussions should focus on how best to gather evidence against criteria 1-3 in a particular evaluation context. Of course one may not agree with Mill’s criterion, and if that is the case philosophy offers alternative views. But in any case, lets focus on evidentiary criteria first, the merits of particular research designs second.

    Food for thought.

  • Chad Green

    Scott, I wish that Mill’s criteria could be applied to all research designs/methods, but in reality I would be hesitant to use it even if the focus was on an individual evaluand.

    Causality is just too complex. For example, Aristotle not only identified four kinds of causes, which he said could have reciprocal and contrary effects, but he also identified two modes of causation (prior and accidental) for each cause, which he further branched into three subcategories (potential, particular, generic). Given all these possibilities, does the temporal order between cause and effect maintain any significance?

    My research on text comprehension led to the same question. For example, Kintsch’s Construction-Integration Model posits that comprehension doesn’t result from interactive variables operating sequentially (like a computer) but rather simultaneously from both both top-down and bottom-up processes. This theory doesn’t leave much room for temporal order either.

    It took a while, but at some point I gradually let go of the primacy of temporal order and replaced it with more useful ideas: construction and self-organization. So if we adapted Mill’s equation accordingly, it would look something like this:
    1: “X” (re)constructs “Y” and vice versa to physical ends.
    Steps 2 and 3 are unnecessary.

    Personally as an evaluator, I have found this simpler worldview to be quite liberating because you can then focus your effort on understanding the universal forms of change (i.e., holons) from which justice, truth, equality, beauty, and other objects of knowledge emerge. This new frame of reflection is perhaps why I see our role as key organizational change agents so clearly.

    BTW, has anyone noticed that, despite all of the paradigm changes to the contrary since Aristotle, our methods, knowledge, and tools reflect nothing but mere sophisticated anthropocentrism? I suppose it is only natural to view things as what they seem since we often base our data on the perceptions of others (i.e., the lowest of Plato’s “divided line”; see data sources listed above).


  • Scott Bayley

    Dear Chad

    Thank you for your thoughtful comments. My main point is not so much to advocate for Mill’s criteria. I’m trying to suggest that we should first focus on evidentiary criteria for making causal inferences and this will then help us to choose our research methods. It seems to me that starting our discussions with a focus on research methods is a dead end.

    Chad, could you tell us a bit more about your philosophical take on cauality and the more useful ideas that you refer to, i.e. construction and self organization.

  • Irene Guijt

    Thanks Chad. Very helpful. I have found (considerable) resistance to articulate the ideal state and existing mental models. So am working with a very ‘gappy’ representation of an hierarchy of inputs to goals. I hope to be at AEA this year so would welcome a talk on this.

  • Chad Green

    Irene: The path of least resistance to the ideal state, so I have found, is by identifying and empowering key boundary spanners who already reflect those values within the organization. These “superhubs” thrive at the emergent level of the organization, that is, at the informal level of social relations where events are driven primarily by shared purpose. This is where the ideal state should germinate. If you can develop a trusting relationship with these influential people and encourage them to work together informally toward a common purpose, then you’ll be on the path of least resistance.

    Scott: Regarding my philosophical take on evaluation, Patton’s captures the ideal state well in Chapter 1 (see right column on pp. 23-26). In all honesty, I see this framework in its entirety as the clearest example yet of the institutional integrity of our profession.

    As for causality, I subscribe to the Naturalistic/Constructivist Inquiry paradigm (e.g., see Lincoln & Guba, 1985, for an excellent critique) since it aligns well with situation model theory. If combined with the appreciative inquiry process, I can reformulate the identified strengths of the program (i.e., the situation model) to reveal a pattern of interrelationships in the mental model of the system as a whole.