Building on an earlier discussion Michael Scriven started about long-term effects (what to with them and without them), I’m interested in people’s thoughts on a related issue.
In the medical profession in particular, there are some very rigid beliefs about what constitutes good enough “evidence of effectiveness” to justify offering, recommending, allowing patients to try, or even just not vehemently opposing a particular type of treatment for a patient. [There are obviously some parallels in other sectors, such as education, social services, international development, criminal justice, etc, but let’s start with some medical examples for now.]
There are some glimmers of hope in other sectors (e.g. in the Best Evidence Synthesis work here in New Zealand). But there are still three areas where there are very serious challenges in building a credible evidence base given the kinds of constraints and realities surrounding them. They are: (1) cutting-edge treatments; (2) treatments that are by their very nature tailored/individualized rather than standardized across patients or populations; and (3) learning what works for small sub-populations.
1. Cutting-edge treatments
Advancements are being made in medical practice all the time, and many of these are initially developed by clinicians (doctors, specialists, surgeons) trying a new approach on a limited number of patients, e.g. when the standard treatments are either not working, or when there’s a plausible idea about how to improve benefits for patients.
In order for a new idea to be trialled on a larger scale, it must be picked up by individuals with a research/evaluation agenda, rather than just an ongoing medical practice. From there, there’s a very long and slow process from writing a grant, through getting it funded, conducting the evaluation, writing it up, then submitting it to a peer-reviewed journal, going through the entire review process, before it is finally published and considered actual “evidence”. On top of this, top journals exhibit a strong preference for RCTs over other types of designs.
Harvard professor of anaesthesia, pediatrics, and medical ethics and chief of the Division of Critical Care Medicine at Boston Children’s Hospital Dr. Robert Truog, in a presentation entitled Ethical Conflicts in Randomized Controlled Trials, lists eight approaches to learning about what works in medicine, in ascending order of confidence:
- Anecdotal Case Reports
- Case Series without Controls
- Case Series with Literature Controls
- Case Series with Historical Controls
- Case / Control Observational Studies
- Randomized Controlled Trials
Truog argues that RCTs are not the only way to learn, even in the medical profession: “Phase I and II trials, which precede RCTs, often provide strong evidence for effectiveness.”
When should we think about alternatives to the RCT? Truog lists four conditions:
- When therapies are potentially life-saving
- When evaluating rapidly developing technologies (improvements in both experimental and control treatments may make the results of an RCT obsolete by the time it is published)
- When RCTs are not the most efficient way to acquire knowledge
- When the non-randomized data [are] compelling
Cutting-edge treatments often provide several of the above conditions, and the reality is that formal RCTs are always going to be way behind the technology. Because of the timeframes involved, the results of RCTs are often “old news” by the time they appear in print. In addition, there are often ethical dilemmas in the rigid use of RCTs. As Robert Truog asks …
“Who wants to be the last patient enrolled in the control arm of a positive randomized controlled trial?”
The same is equally true for a RCT of an educational, community health, international development, or business development intervention.
2. Tailored/individualized and adaptive treatments
In the medical and health professions, as in many other arenas, there are certain treatments (or programs/initiatives) that by their very nature must be completely tailored to the individual (or to the community, or to the organization) and/or that must be responsive to changing needs and need to be adapted over time.
One medical example of this is acupuncture and the use of Chinese herbs. Individuals with the same general Western diagnosis (e.g. depression, back pain, infertility), and even with the same basic underlying medical cause for that diagnosis (e.g. endometriosis, polycystic ovaries, diminished ovarian reserve), the Chinese medicine diagnosis of the underlying imbalances may differ substantially. A competent acupuncturist will proceed with a highly individualized treatment based on each person’s specific (Western and Eastern) diagnosis, will reassess at each session and tweak the treatment accordingly.
Clearly, this individualization and constant tweaking of treatment are at odds with the usual approach to RCTs, which is to standardize treatment and have each practitioner deliver it in exactly the same way. [There are some exceptions to this problem, e.g. some RCTs have been conducted to evaluate specific acupuncture treatments before and after IVF transfer, with statistically and practically significant effects documented. In fertility treatment, this covers just one very specific short-term application, but not the kinds of longer-term treatments that are also commonly used by couples experiencing infertility.]
An additional complication for evaluating acupuncture treatment is that diagnosis requires skilled professional judgment and (given that treatment cannot be simplistically standardized) treatment efficacy is highly dependent on the competence of the practitioner. A large-scale RCT would need to use several practitioners whose competence may vary widely, and this cause of variance could easily wash out effects.
This challenge is not limited to healthcare and medicine. Think about organizational development or community development initiatives. We have all heard countless examples of programs that really only worked amazingly well because of the passion of one or two highly committed people at key locations. Or that needed to be adapted locally to respond to changing needs and aspirations (or because they were initially not well enough understood). If the intervention couldn’t be standardized across multiple locations, it doesn’t fit the mold very well for an RCT.
3. What works for small subpopulations?
A third major challenge in working out “what works for whom” in medicine is that some patient subgroups have very specific combinations of factors that may lend themselves to particular kinds of treatments, but these populations are too small in number to even develop an RCT or any other quantitative design with sufficient statistical power to meet the usual requirements for publication. Or, the “target audience” for the findings is considered too narrow.
A good example is looking at the effectiveness of IVF treatment. It’s very easy to find a substantial sample size of women in their 30s with, say, blocked fallopian tubes or endometriosis – they often have insurance coverage for infertility or are eligible for publicly funded treatment, so there are plenty trying various IVF protocols (large N) and there is quite good knowledge about what works for them.
But suppose we wanted to understand what works for women over 40, or (even harder) over 42, who have specific diagnoses? First, the numbers are naturally lower for this group because most couples have completed their families by this age. For those still trying, the woman’s age and/or her specific diagnoses often mean that she is not eligible for insurance coverage or publicly funded treatment. So, there are far fewer trying IVF, and even fewer again for the specific diagnoses that are likely to make one ineligible for insurance or publicly funded treatment.
The reality is that some specific sub-populations will never be large enough in numbers to allow the use of RCTs to learn what works. But at the same time, certain clinicians will refuse to allow the patient to try treatment approaches that have not been supported by what they consider to be “solid” clinical trials.
At the same time, there are certain clinicians around the world who are known as top of their fields in dealing with specific types of case (such as women over 40). However, only some of them publish their findings, and often their work is sidelined by mainstream medicine as being “fringe” – and the limited sample sizes and only semi-standardized treatment protocols trigger further snorts of derision about the quality of their “evidence”.
The same is again true in education, community health, international development, business, and just about any other field one can name.
Where does this leave us – and where to next?
Right now, in medicine (and to varying degrees elsewhere), it’s only a small exaggeration to say:
- If you are seeking a “tried and true” (as supported by RCTs, or by other studies published in peer-reviewed journals) approach, you will only have access to “old” treatments and initiatives – and (in the case of RCT evidence) only those that can be completely standardized.
- If you’re after something cutting-edge or that needs to be tailored or adapted mid-stream, you have to pin your hopes on anecdotal evidence (and hope your physician or funder will support you).
- If you’re a member of a relatively large or typical subgroup, your treatment can be informed by evidence from RCTs and other published studies with a decent sample size.
- But if you’re in a very small minority sub-population, all we have is “anecdotal case studies” and the whole exercise is basically a crap-shoot.
Here in Aotearoa New Zealand, we have seen some very high quality government-funded work integrating a range of qualitative, quantitative and mixed method evidence about what works in education – the Iterative Best Evidence Synthesis (BES). A short quote from the Guidelines for Generating a Best Evidence Synthesis Iteration explains how evidence is selected for inclusion:
The [New Zealand] Ministry of Education is using the term ‘best’ within the best evidence synthesis programme to describe a body of evidence that provides credible evidence, and explanations for, influences that have made, and can make a bigger difference to desirable learner outcomes for diverse learners simultaneously. The criterion for selection of evidence for a best evidence synthesis is that the research provides evidence about impacts on learner outcomes. …
This criterion for selection of evidence means that research from a wide range of methodological designs (including for example, action research studies, case studies, microgenetic studies of classroom processes, ethnographic-outcome focused studies, quasi-experimental research, multiple regression studies, longitudinal studies and experimental research) can make valued contributions to a best evidence synthesis. The point of synthesis is that a cumulative body of research, carefully interrogated, provides more explanatory power than findings from any one research study or design type. (p. 33)
This is in stark contrast to the U.S.-based What Works Clearinghouse (WWC) evidence standards:
The WWC reviews each study that passes eligibility screens to determine whether the study provides strong evidence (Meets Evidence Standards), weaker evidence (Meets Evidence Standards with Reservations), or insufficient evidence (Does Not Meet Evidence Standards) for an intervention’s effectiveness. Currently, only well-designed and well-implemented randomized controlled trials (RCTs) are considered strong evidence, while quasi-experimental designs (QEDs) with equating may only meet standards with reservations; evidence standards for regression discontinuity and single-case designs are under development.
As a humorous side note, Michael Scriven recently (on EVALTALK) nicknamed the WWC the “WWQNC, standing for What Works for Quantitative Nerds Clearinghouse (pronounced ‘WONKS’)”.
While it’s very heartening to see some more enlightened evidence synthesis work such as NZ’s BES, I am still not sure we yet have good evidence accumulation and synthesis solutions for:
- cutting-edge treatments where the technology and thinking is changing faster than RCTs (or even other large-scale long-term evaluation designs) can usefully inform
- individualized, tailored, and adapt-as-you-go initiatives
- small sub-populations that need to know what’s going to work for them
Are there ways, in medicine, to accumulate knowledge directly from clinicians and aggregate that to get approximate answers to these “what works for whom and under what conditions” questions? [I recently had a discussion with a medical academic who insisted it definitely was NOT possible!]
Are there ways in which outcome data and other learnings from localized small-scale initiatives can be meaningfully aggregated? I have been working on several projects that attempt to do just this (one in special education, one in primary school literacy, one for evaluating a nationwide strategy designed to help M?ori (NZ indigenous) students enjoy education success as M?ori) but would be interested how others have gone about the same.
For more on RCTs, see also my short JMDE (2006) editorial: The RCTs-Only Doctrine: Brakes on the Acquisition of Knowledge?