What constitutes “evidence”? Implications for cutting-edge, tailored treatments, and small sub-populations

Building on an earlier discussion Michael Scriven started about long-term effects (what to with them and without them), I’m interested in people’s thoughts on a related issue.

In the medical profession in particular, there are some very rigid beliefs about what constitutes good enough “evidence of effectiveness” to justify offering, recommending, allowing patients to try, or even just not vehemently opposing a particular type of treatment for a patient. [There are obviously some parallels in other sectors, such as education, social services, international development, criminal justice, etc, but let’s start with some medical examples for now.]

There are some glimmers of hope in other sectors (e.g. in the Best Evidence Synthesis work here in New Zealand). But there are still three areas where there are very serious challenges in building a credible evidence base given the kinds of constraints and realities surrounding them. They are: (1) cutting-edge treatments;  (2) treatments that are by their very nature tailored/individualized rather than standardized across patients or populations; and (3) learning what works for small sub-populations.

1. Cutting-edge treatments

Advancements are being made in medical practice all the time, and many of these are initially developed by clinicians (doctors, specialists, surgeons) trying a new approach on a limited number of patients, e.g. when the standard treatments are either not working, or when there’s a plausible idea about how to improve benefits for patients.

In order for a new idea to be trialled on a larger scale, it must be picked up by individuals with a research/evaluation agenda, rather than just an ongoing medical practice. From there, there’s a very long and slow process from writing a grant, through getting it funded, conducting the evaluation, writing it  up, then submitting it to a peer-reviewed journal, going through the entire review process, before it is finally published and considered actual “evidence”. On top of this, top journals exhibit a strong preference for RCTs over other types of designs.

Harvard professor of anaesthesia, pediatrics, and medical ethics and chief of the Division of Critical Care Medicine at Boston Children’s Hospital Dr. Robert Truog, in a presentation entitled Ethical Conflicts in Randomized Controlled Trials, lists eight approaches to learning about what works in medicine, in ascending order of confidence:

  1. Anecdotal Case Reports
  2. Case Series without Controls
  3. Case Series with Literature Controls
  4. Case Series with Historical Controls
  5. Databases
  6. Case / Control Observational Studies
  7. Randomized Controlled Trials
  8. Meta-analyses

Truog argues that RCTs are not the only way to learn, even in the medical profession: “Phase I and II trials, which precede RCTs, often provide strong evidence for effectiveness.”

When should we think about alternatives to the RCT? Truog lists four conditions:

  1. When therapies are potentially life-saving
  2. When evaluating rapidly developing technologies (improvements in both experimental and control treatments may make the results of an RCT obsolete by the time it is published)
  3. When RCTs are not the most efficient way to acquire knowledge
  4. When the non-randomized data [are] compelling

Cutting-edge treatments often provide several of the above conditions, and the reality is that formal RCTs are always going to be way behind the technology. Because of the timeframes involved, the results of RCTs are often “old news” by the time they appear in print. In addition, there are often ethical dilemmas in the rigid use of RCTs. As Robert Truog asks …

“Who wants to be the last patient enrolled in the control arm of a positive randomized controlled trial?”

The same is equally true for a RCT of an educational, community health, international development, or business development intervention.

2. Tailored/individualized and adaptive treatments

In the medical and health professions, as in many other arenas, there are certain treatments (or programs/initiatives) that by their very nature must be completely tailored to the individual (or to the community, or to the organization) and/or that must be responsive to changing needs and need to be adapted over time.

One medical example of this is acupuncture and the use of Chinese herbs. Individuals with the same general Western diagnosis (e.g. depression, back pain, infertility), and even with the same basic underlying medical cause for that diagnosis (e.g. endometriosis, polycystic ovaries, diminished ovarian reserve), the Chinese medicine diagnosis of the underlying imbalances may differ substantially. A competent acupuncturist will proceed with a highly individualized treatment based on each person’s specific (Western and Eastern) diagnosis, will reassess at each session and tweak the treatment accordingly.

Clearly, this individualization and constant tweaking of treatment are at odds with the usual approach to RCTs, which is to standardize treatment and have each practitioner deliver it in exactly the same way. [There are some exceptions to this problem, e.g. some RCTs have been conducted to evaluate specific acupuncture treatments before and after IVF transfer, with statistically and practically significant effects documented. In fertility treatment, this covers just one very specific short-term application, but not the kinds of longer-term treatments that are also commonly used by couples experiencing infertility.]

An additional complication for evaluating acupuncture treatment is that diagnosis requires skilled professional judgment and (given that treatment cannot be simplistically standardized) treatment efficacy is highly dependent on the competence of the practitioner. A large-scale RCT would need to use several practitioners whose competence may vary widely, and this cause of variance could easily wash out effects.

This challenge is not limited to healthcare and medicine. Think about organizational development or community development initiatives. We have all heard countless examples of programs that really only worked amazingly well because of the passion of one or two highly committed people at key locations. Or that needed to be adapted locally to respond to changing needs and aspirations (or because they were initially not well enough understood). If the intervention couldn’t be standardized across multiple locations, it doesn’t fit the mold very well for an RCT.

3. What works for small subpopulations?

A third major challenge in working out “what works for whom” in medicine is that some patient subgroups have very specific combinations of factors that may lend themselves to particular kinds of treatments, but these populations are too small in number to even develop an RCT or any other quantitative design with sufficient statistical power to meet the usual requirements for publication. Or, the “target audience” for the findings is considered too narrow.

A good example is looking at the effectiveness of IVF treatment. It’s very easy to find a substantial sample size of women in their 30s with, say, blocked fallopian tubes or endometriosis – they often have insurance coverage for infertility or are eligible for publicly funded treatment, so there are plenty trying various IVF protocols (large N) and there is quite good knowledge about what works for them.

But suppose we wanted to understand what works for women over 40, or (even harder) over 42, who have specific diagnoses? First, the numbers are naturally lower for this group because most couples have completed their families by this age. For those still trying, the woman’s age and/or her specific diagnoses often mean that she is not eligible for insurance coverage or publicly funded treatment. So, there are far fewer trying IVF, and even fewer again for the specific diagnoses that are likely to make one ineligible for insurance or publicly funded treatment.

The reality is that some specific sub-populations will never be large enough in numbers to allow the use of RCTs to learn what works. But at the same time, certain clinicians will refuse to allow the patient to try treatment approaches that have not been supported by what they consider to be “solid” clinical trials.

At the same time, there are certain clinicians around the world who are known as top of their fields in dealing with specific types of case (such as women over 40). However, only some of them publish their findings, and often their work is sidelined by mainstream medicine as being “fringe” – and the limited sample sizes and only semi-standardized treatment protocols trigger further snorts of derision about the quality of their “evidence”.

The same is again true in education, community health, international development, business, and just about any other field one can name.

Where does this leave us – and where to next?

Right now, in medicine (and to varying degrees elsewhere), it’s only a small exaggeration to say:

  • If you are seeking a “tried and true” (as supported by RCTs, or by other studies published in peer-reviewed journals) approach, you will only have access to “old” treatments and initiatives – and (in the case of RCT evidence) only those that can be completely standardized.
  • If you’re after something cutting-edge or that needs to be tailored or adapted mid-stream, you have to pin your hopes on anecdotal evidence (and hope your physician or funder will support you).
  • If you’re a member of a relatively large or typical subgroup, your treatment can be informed by evidence from RCTs and other published studies with a decent sample size.
  • But if you’re in a very small minority sub-population, all we have is “anecdotal case studies” and the whole exercise is basically a crap-shoot.

Here in Aotearoa New Zealand, we have seen some very high quality government-funded work integrating a range of qualitative, quantitative and mixed method evidence about what works in education – the Iterative Best Evidence Synthesis (BES). A short quote from the Guidelines for Generating a Best Evidence Synthesis Iteration explains how evidence is selected for inclusion:

The [New Zealand] Ministry of Education is using the term ‘best’ within the best evidence synthesis programme to describe a body of evidence that provides credible evidence, and explanations for, influences that have made, and can make a bigger difference to desirable learner outcomes for diverse learners simultaneously. The criterion for selection of evidence for a best evidence synthesis is that the research provides evidence about impacts on learner outcomes. …

This criterion for selection of evidence means that research from a wide range of methodological designs (including for example, action research studies, case studies, microgenetic studies of classroom processes, ethnographic-outcome focused studies, quasi-experimental research, multiple regression studies, longitudinal studies and experimental research) can make valued contributions to a best evidence synthesis. The point of synthesis is that a cumulative body of research, carefully interrogated, provides more explanatory power than findings from any one research study or design type. (p. 33)

This is in stark contrast to the U.S.-based What Works Clearinghouse (WWC) evidence standards:

The WWC reviews each study that passes eligibility screens to determine whether the study provides strong evidence (Meets Evidence Standards), weaker evidence (Meets Evidence Standards with Reservations), or insufficient evidence (Does Not Meet Evidence Standards) for an intervention’s effectiveness. Currently, only well-designed and well-implemented randomized controlled trials (RCTs) are considered strong evidence, while quasi-experimental designs (QEDs) with equating may only meet standards with reservations; evidence standards for regression discontinuity and single-case designs are under development.

As a humorous side note, Michael Scriven recently (on EVALTALK) nicknamed the WWC the “WWQNC, standing for What Works for Quantitative Nerds Clearinghouse (pronounced ‘WONKS’)”.

While it’s very heartening to see some more enlightened evidence synthesis work such as NZ’s BES, I am still not sure we yet have good evidence accumulation and synthesis solutions for:

  1. cutting-edge treatments where the technology and thinking is changing faster than RCTs (or even other large-scale long-term evaluation designs) can usefully inform
  2. individualized, tailored, and adapt-as-you-go initiatives
  3. small sub-populations that need to know what’s going to work for them

Are there ways, in medicine, to accumulate knowledge directly from clinicians and aggregate that to get approximate answers to these “what works for whom and under what conditions” questions? [I recently had a discussion with a medical academic who insisted it definitely was NOT possible!]

Are there ways in which outcome data and other learnings from localized small-scale initiatives can be meaningfully aggregated? I have been working on several projects that attempt to do just this (one in special education, one in primary school literacy, one for evaluating a nationwide strategy designed to help M?ori (NZ indigenous) students enjoy education success as M?ori) but would be interested how others have gone about the same.

For more on RCTs, see also my short JMDE (2006) editorial: The RCTs-Only Doctrine: Brakes on the Acquisition of Knowledge?

12 comments to What constitutes “evidence”? Implications for cutting-edge, tailored treatments, and small sub-populations

  • As a wee aside, on the topic of ‘what is the job of evidence’ it was rather interesting to hear [University of Michigan-based kiwi political scientist] Rob Salmond talking on National Radio this afternoon [topic starts at about 5:30 on the audio file]. He argued that Politicians in New Zealand, from all sides of the political spectrum see the job of evidence as supporting a position they already hold, rather than a tool to help make a decision. He believes that once they’ve found evidence consistent with their position, then they stop looking. And the consequence of this is that we end up with lower quality policy decisions, because they are not based on an analysis of large amounts of evidence. To illustrate his point, he pointed out that in this week’s budget, the Prime Minister is correct when he argues that the US top tax rate is 35c and that this doesn’t kick in until someone earns $370,000. However, the conclusion that the Prime Minister invites NZers to make is that anyone earning less than this pays less tax, is wrong because Americans pay a whole range of other taxes, such as State Tax and Social Security Tax. He also illustrated his point with the comparisons that are made between the lower Australian headline company tax to NZ’s. This is factually correct, however once again the implication made is that NZ needs to reduce its company tax to stop companies going across the Tasman. What is not presented is that Australian companies pay more than 2 times the Accident Insurance levies that NZ companies do, and employees retirement costs are more than 4 times those of NZ. A case of selective use of evidence to come to decisions about what’s needed. Seems uncannily like the selective use of only RCT evidence to make important decisions about what might be ‘best’ in particular circumstances to me!

  • Jane Davidson

    Interesting piece, Kate (I’ve added links to the audio). I do think we have come to expect this sort of thing from politicians (unfortunately), but what gets me is the total lack of media vigilance to this sort of misleading information – see the earlier post entitled: The media and evaluation reporting – clueless or unscrupulous?

    I am not quite convinced, though, that the deliberate use of factually correct but misleadingly presented evidence by politicians is quite the same species of selective evidence use as the RCTs-only doctrine. I suppose they are both selective for ideological reasons – in one case political ideology; the other methodological ideology.

    Robert Truog (who I mentioned above) included a nice quote from Benjamin Freedman: “The use of statistics in medical research has been compared to a religion: it has its high priests (statisticians), supplicants (journal editors and researchers), and orthodoxy (for example, p<.05 is “significant”)”

  • David Earle

    Lots to digest there – but one immediate thought from the medical world; and that is the difference between the researchers’ notion of evidence and the diagnotitians’ notion (e.g. nurse, GP, specialist).

    The latter is dealing with real individual cases that may or may not fit the preestablished rules from the evidence. For them evidence is reading the symptoms as best as possible and trying a best-fit treatment. It is a much more heuristic approach to ‘what works for whom’. I think a lot of our social evaluation is at the equivalent to the diagnostic end of work – and the literature of diagnostics may be a more fruitful area for comparison.

    To put this into an example – I present to my doctor with an ongoing dry cough. He does some none invasive checks. Decides there is not a lot going on. He says its most likely asthma, could be a virus and just might be an infection. Asthma is the easiest to treat and the treatment will also assist with clearing minor viruses or infections. So lets treat this as though it is asthma, but if you get any of the following symptoms or it doesn’t improve within a week, come back to me.

    Behind this is a whole lot of RCT drug trial – but also a lot of practical experience and decision making paths to come up with the most appropriate treatment for the case.

  • Jane Davidson

    Thanks, David, for your thoughts. Yes, a key issue here is that clinicians draw on both research evidence and their own clinical knowledge/experience to try and identify the appropriate treatment.

    If I understand you correctly, the case you describe is a clinician identifying several possible causes and prescribing treatments that have presumably been trialled and are known to be effective for treating asthma or viruses or bacterial infections, respectively. So, the key issue is which causes of the condition apply, and how to treat the condition in the face of uncertainty about which causes are in play.

    But what if your condition, or the possible causes of your condition, is/are quite rare, and/or what if a possible treatment is very new or needs to be tailored carefully for each patient, so that the research evidence about what works is very hard to find and may consist of (say) a few case studies with ‘self as control’ and perhaps some stronger designs (larger samples, better comparisons) but with a much more diverse group of patients (most of whom are not like you), thereby making it hard to figure out whether the approach would work in your specific case?

    If your dr is working on the cutting edge or specializes in treating patients with your specific combination of symptoms and other characteristics, then they will have their own clinical knowledge and experience to draw on. Other clinicians, if they only value RCT-type evidence, may refuse to treat you with anything other than standard approaches that you have already tried and that are supported with what they consider to be “evidence”.

    So, what I’m wondering is: Do we have any good mechanisms for picking up that smaller-scale, local practice-based clinical knowledge, experience and know-how and aggregating it up so that it’s “as a set” stronger than “anecdotal” and can then be used to inform other practitioners about what works?

    In medicine, it seems to me, there appears to be strong resistance to this idea. And so valuable knowledge and know-how remains trapped within the clinical practices of those few who tinker away with the difficult cases or the experimental approaches, which (it seems to me) impedes patients’ ability to access treatment that is truly tailored to their condition unless they are lucky enough to be working with certain specific drs.

    [OK, voting with one’s feet is one option, but as we know, publicly funded health systems like ours often don’t allow it, and “in network”/”out of network” rules often prevent it in health insurance-based systems.]

    Just extrapolating back into the domains where more of us work, it seems to me that similar barriers to gathering and aggregating on-the-ground small-scale findings slow the advancement of good ideas in community change, international development, education, business, the not-for-profit sector, and many more …

    I’ve been doing some work recently with the problem of how to make small-scale evaluations that don’t use the exact same measures “aggregable” – a sort of qualitative/mixed method meta-analysis idea, if you like, something that is a step up from systematic reviews but that doesn’t exclude findings based on qualitative or mixed method data the way meta-analysis does.


  • Dear Dr Davidson

    You wrote:
    “In the medical profession in particular, there are some very rigid beliefs about what constitutes good enough “evidence of effectiveness” to justify offering, recommending, allowing patients to try, or even just not vehemently opposing a particular type of treatment for a patient. [There are obviously some parallels in other sectors, such as education, social services, international development, criminal justice, etc, but let’s start with some medical examples for now.”
    And continued:
    “I suppose they are both selective for ideological reasons – in one case political ideology; the other methodological ideology.”
    OK the debate started, Please let me extend the debate!
    I suppose we encounter with both political ideology & methodological ideology simultaneity in the case of RCTs.

    For this widen we must ask ourselves: What are the relationships between Policy, profit, propriety & power in our Profession & scope?

    Some explanation for this question can be finding at this address:


    And my response to these explanations is at this one




  • Jane Davidson

    Thanks for your contribution to the discussion, and the links, Moein.

    I think it’s helpful to make a distinction between the politics, profit and power perspectives of RCTs as an ideologically-driven preference and the political motivations raised by Kate.

    The ‘randomistas’ (RCT evangelists) have, as you imply, a vested interested in ensuring that their chosen approach is privileged above all others:

    1. RCT specialists win more lucrative contracts when RCTs are privileged over other designs – especially (and this is the problem) in those cases when RCTs are not in fact the best design to infer causation or answer the key evaluation questions.

    2. RCT specialists’ ‘mana’ (prestige, or standing as respected professionals) is enhanced when their approach is privileged above all others and when alternative designs are derogated as being inferior, sloppy, or unscientific.

    There is a lot of ‘turf protection’ wrapped up in all this.

    I do, however, think there is a contrast (but yes, a parallel as well) between (a) politicians’ selective and misleading use of evidence and (b) the refusal by the randomistas to concede that other forms of evidence for causal attribution are at all acceptable.

    In the latter case, RCT advocates do not usually have a vested interest in what the findings are, just in the methodology that is used – or believed, respected, privileged – in the creation of those findings.

    On the other hand, politicians don’t much care what methodology produced the results (except perhaps a preference for the simple and easy to explain), as long as it supports their predetermined position.

    So, with regards to your contention that “we encounter with both political ideology & methodological ideology simultaneity in the case of RCTs.” – yes, it’s the politics and power of what methodology to use, but that’s not the same as the political ideology behind supporting a specific program or policy.

    The other thing I draw from reflecting on the power and politics (and, let’s face it, considerable political success in the States) of privileging one methodology over another is that evidence from non-RCT sources needs not just be “aggregable” but also “sound-biteable”, i.e. able to be communicated succinctly to politicians, media, and the public, who can’t be swayed with explanations that are overcomplicated …


  • David Earle

    Jane, you said …

    “So, what I’m wondering is: Do we have any good mechanisms for picking up that smaller-scale, local practice-based clinical knowledge, experience and know-how and aggregating it up so that it’s “as a set” stronger than “anecdotal” and can then be used to inform other practitioners about what works? ”

    Again, without being able to answer the question – but may shed more light on the nature of the problem:

    Our local GP is very supportive of having medical students observing his consultations. On one occasion, my wife took our daughter in for some (I forget) – and our Dr said to the student, “Now this is something you won’t have come across before – because it is generally only seen and treated by GPs”

    What is interesting about this is the idea that there is knowledge and practice within particular practioner groups which exists independently of the overall professional body of knowledge – because only that group has to deal with it.

    So the RCT-based knowledge in health is dealing with a very narrow and specific part of health treatment. I would hazard a guess at less than 20% of the knowledge base comes from this kind of source and more than 60% is accumulated experience of practioners.

  • Patricia Rogers

    In terms of an approach for “picking up that smaller-scale, local practice-based clinical knowledge, experience and know-how and aggregating it up so that it’s “as a set” stronger than “anecdotal” and can then be used to inform other practitioners about what works?” (or what works for whom in what circumstances) I think realist synthesis is a very promising approach.

    A realist synthesis can include all evidence where the specific claims it is making are credible – including credible case studies and other non-experimental data – and focuses in identifying contexts within which specific causal mechanisms operate.

    It’s different to a narrative review of the literature because it is more analytic in its approach to individual studies and to the synthesis process. A lit review will report that ‘some studies found this and some studies found that’, but a realist synthesis will try to explain varying results within and across studies.

    For further information about this approach, a good place to start is

  • “Do we have any good mechanisms for picking up that smaller-scale, local practice-based clinical knowledge, experience and know-how and aggregating it up so that it’s “as a set” stronger than “anecdotal” and can then be used to inform other practitioners about what works?”

    The first step is needed — “picking up smaller-scale, local practice-based…knowledge…” — may be worth completing without worrying too much initially about how this local knowledge (what I call “practice wisdom”) might be generalised or aggregated.

    In may experience, after the practice wisdom has been identified, it is often possible to identify ways to assess its validity within a different sample. Not always, but sometimes.

    I have taught a qualitative technique, the Donovan Technique, for decades: it provides one way for practitioners to attempt to clarify and begin to confirm their own “practice wisdom”.

    Unfortunately, this technique is only designed to identify and clarify certain types of “practice wisdom” that are based on having observed the same pattern in 3 to 3 situations.

    A blog is not the place to discuss the details of this technique: my aim here is to suggest that, based on my limited experience, it may be worth seeking answers to separate questions, the first about approaches to identifying and documenting local “practice wisdom” or “knowledge” and the second about how this knowledge might then be generalised or validated.

  • I don’t know if this is relevant, but somewhere … long ago … I read that many if not most surgical interventions, especially the blood and guts ego driven surgery such as orthopedic surgery, have never been subjected to any independent verification. They get developed by one surgeon and picked up by another and then becomes standard practice.

  • Patricia Rogers

    Interesting point, Bob. For a report of evidence from RCTs showing no benefit of arthroscopy for osteoarthritis see http://www.jfponline.com/Pages.asp?AID=7419&issue=March_2009&UID=

  • The process for developing and implemeting established standards is the formal way evidence is disseminated. In the U.S., the medical profession is a strong community of practice segregated by specialty area, e.g. pediatrics. When there is a new practice, it gets disseminated through the email and conferences, and as more physicians implement these new or modified practices, they share information on how things went. This process can disseminate the new practice much faster than you can imagine (diffusion of innovation). You might worry about how doctors decide what to believe and what not to believe, and whether we are their guinea pigs, but there you have it.

    Another process used in the U.S. for “identifying evidence” in settings where there might not be time or resources to do an independent study is to aggregate information on practices across doctors in a hospital, e.g. the rate of cesarians, and then show each doctor’s stats against the average. Then, doctors come together to talk about the variation, and the choices and practices that underly the standard and deviations. In this way, they create agreement on desired and target practices.

    What is challenging in developing countries is how frequently there might be two sets of evidence-based standards that co-exist and are pushed by different donors. This really gets confusing in the field! A lot of good work is done in the area of quality improvement to get a handle on this issue, and to help local experts and health professionals take charge of their own “evidence.”