With apologies to all for our little bit of downtime over the weekend while we changed servers …

Here’s an interesting snippet that came through on a listserv recently from industrial/organizational psychologist Paul Barrett, who spotted a recent review from Olle Häggström of Ziliak and McCloskey’s (2008) book ** The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives.** The review was in the “commentary” column of Notices of the Journal of the American Mathematical Society, which Paul quoted:

A major point in

The Cult of Statistical Significanceis the observation that many researchers are so obsessed with statistical significance that they neglect to ask themselves whether the detected discrepancies are large enough to be of any subject-matter significance. Ziliak and McCloskey call this neglect sizeless science. They exemplify and discuss instances of sizeless science in, among other disciplines, medicine and psychology, but for obvious reasons they focus most of their attention on economics. In one study, they have gone over all of the 369 papers published in the prestigious journal American Economic Review during the 1980s and 1990s that involve regression analysis. In the 1980s, 70 percent of the studied papers committed sizeless science, and in the 1990s this alarming figure had increased to a stunning 79 percent. A number of other kinds of misuse of statistics are considered in the same study, with mostly equally depressing results.

Not all statisticians have been convinced by these arguments, of course. Olli Miettinen, reviewing the book in the European Journal of Epidemiology, agreed that quantification was needed but argued this was only relevant AFTER statistical significance had been established.

I haven’t read the original book, so won’t leap into a debate about which review has got it right with respect to the book’s claims, but I did think that the points raised in Häggström’s review were worth pondering as they might apply to evaluation.

In genuine evaluation, ascertaining the practical significance of outcomes – and being able to answer the all-important “are they sizeable enough?” question – are absolutely central.** How good is a “good” outcome for a program, policy, initiative, product or service?** I’ve been puzzling over this one a lot recently, and will post on this later in the week. One thing we do know is that standard applied social science research methods *don’t*, by themselves, provide the answer.

Paul Barrett had some interesting reflections on the state of applied psychology and how much “sizeless science” it produces. What if we pondered these same comments with respect to evaluation?

As the “quantitative psychology” discipline quietly continues its decline out of sheer boredom and indifference to its paltry return on investment, its proponents might like to reflect on why they persist in producing Cargo Cult and “sizeless science”.

Investigative science and learning how to answer really awkward problems and questions can be really exciting; the mindless rituals served up in undergraduate and graduate psychology degree methods courses is inexcusable.

It’s time for a new university degree in psychology, one that educates in both content and just as important, how to approach answering interesting but challenging questions in a non-quantitative science. Rituals just don‘t cut it in the 21st century.

One would hope that a key element of evaluation is grappling with what Paul calls “awkward problems and questions” and what we might call the “value” questions – but how well do we really do it? Are we, as a profession, plagued with some of the same problems, particularly when we limit our methods to those that have come out of the social sciences? Is the same true in both qualitative and quantitative approaches to evaluation? Does mixed methods evaluation hit the mark any better?

Thoughts?

Related posts:

- Why genuine evaluation must be value-based (Jane Davidson)
- The importance of values for substantiating evaluative conclusions (Tererai Trent)
- How good is a “good” outcome? (coming later this week, from Jane)

I believe my purported disagreement with Miettinen may be less dramatic than you suggest, as evidenced by the following passage a little further into my review:

Sometimes the authors push their position a bit far, such as when they ask themselves: “If null-hypothesis significance testing is as idiotic as we and its other critics have so long believed, how on earth has it survived?” (p 240). Granted, the single-minded focus on statistical significance that they label sizeless science is bad practice. Still, to throw out the use of significance tests would be a mistake, considering how often it is a crucial tool for concluding with confidence that what we see really is a pattern, as opposed to just noise. For a data set to provide reasonable evidence of an important deviation from the null hypothesis, we typically need both statistical and subject-matter significance.

Many thanks, Olle, for chiming in on this – and for sparking a good discussion topic with your review. Thanks, too, for the clarification on your position.

A related discussion that pops up a lot in evaluation is the appropriateness in most evaluation work of the default levels of certainty for significance testing that are generally applied in academic research. Depending on the context, most decision makers I run into are looking for something closer to a ‘balance of evidence’ standard of proof that an effect or pattern is real rather than a ‘beyond reasonable doubt’ standard (p < .05). The other challenge with applying statistical significance in our discipline is that sound evaluation practice generally uses a range of evidence to draw evaluative conclusions about key outcomes (or other aspects of quality/performance). This often includes a mixture of both qualitative and quantitative evidence, and not all of that evidence lends itself to statistical significance testing – not even all of the quantitative data. Even if the quantitative data on a particular measure doesn’t quite attain statistical significance (at whatever level of certainty is appropriate for the decision making context), the additional evidence from other sources may reduce that uncertainty and lead to a conclusion that the effect in question is very probably real. If it’s real, is it also substantial, valuable, well worth the investment devoted to producing it? That’s the more challenging question in most cases … These are topics of great interest to evaluators, particularly the question of how we draw systematic and transparent conclusions about practical significance, sometimes referred to as ‘interocular’ significance (the kind that hits one right between the eyes). More on that in another post tomorrow.

It might come as a bit of a surprise to some, but I spent my formative years as a statistician. Indeed I think I was one of the first students ever to get a Statistics A level at high school in Britain. I absolutely loved statistics until one day, working as a research ecologist on a gnarly problem to do with the likely impact of a large coastal development on migratory birds, I asked myself “what would happen to my conclusion if truth happened at the 80% confidence limits not 95%?”. That set me off on a journey that ultimately led me to the position of a statistical cynic. Remember I had been trained to death to believe that science combined with statistics led to “truth”. But it dawned on me that the 95% – the one in 20 probability – that determined “truth” was entirely arbitrary. Someone, somewhere invented it. Truth was a convention. Yes I know that over the water, across the channel German and French philosophers had nailed that a decade before, but I was a lad in a lab in Norfolk peering through a microscope at bird food, not at the feet of Derrida at the Ecole Normale. I started to ponder what aspects of life commonly and scientifically considered “untrue” or “unproven” would be considered otherwise if the line was moved one way or the other. Would we have had someone on the moon earlier, or later? Dangerous musings for a young scientist. Too dangerous: I became a community worker and environmental activist – areas where truth is much more negotiable.

But in the three decades since, I’ve never been able to find out just how that statistical convention arose and how it became the dominant proxy for “truth” in science.

Bob, I think this is the main problem with the usual way in which statistical significance testing is used in science and social science as a black-and-white accept/reject decision about the null hypothesis. To me, it’s just one piece of information in the evidence jigsaw to try and figure out whether an effect or aspect of performance/quality is worth writing home about. It’s useful to have a rough idea how likely it is that the apparent effect is just “noise” in the data – but all this depends on how certain we need to be and what other evidence points to the effect being real and not an illusion. And all that only answers the relatively simple question of whether the effect is “there”, not whether it was valuable or worthwhile …

Perhaps it would help to formally link analysis of statistical significance to risk management in developing responses to evaluation data.

In some situations, it is more important to avoid a Type I error (a False Alarm – thinking there is a real difference when it’s just sampling error). This might be particularly important when analyzing large data sets where even a small difference can be statistically significant but not clinically significant.

In other situations, when we are searching for some sites which are performing better so we can learn from them, it might be more important to avoid a Type II error (a Miss – dismissing something as probably due to random error and not realizing it is a real difference) – at the very least, identifying possible sites for further investigation and possible documentation of better practices.

The other area where significance testing is often misapplied is when it’s used to analyze data from a census (either a formal government census, or any survey covering an entire population). Does it make sense to use statistical significance tests – or should this simply be reported as the results (assuming the measurement and response rates have been adequate?). The counter-argument is that even census data can be thought of as sampling a particular time frame, so the statistical significance is looking the chance that a random fluctuation at that time could misrepresent the real situation. Since many of the comparisons we’re looking at vary over time, I’m not entirely convinced by that argument. How do you handle reporting census data?

Hello Jane/Olle/Bob

For me, there are two issues here:

1. The useful practical role of data model statistics

2. The not-useful role of data model statistics in investigation of cause.

That #2 point really comes down to whether you view causality as deterministic, or probabilistic. I’m very definitely arguing the former as the only sensible view of causality.

The problem is many embedded in the world of “statistics is all in the social sciences” misunderstand my position- as evidenced in this pithy reply on SEMNET (the structural equation modeling listserv), who proponents treat their statistical models as “worldy causal” models.

“In your numerous critiques and dismissals of statistical practice in the social sciences, you seem to directly blame the methods and software more so than their misapplication. Yet, the author of the review you cite seems to make this important distinction, and some of the comments suggest Haggstrom to be a Bayesian himself. However, your posts often bring the words “baby” and “bathwater” to mind. :)”

My response …

“Ah Cam, you are like John Antonakis – so truly embedded in the realm of statistical data models that you cannot see what I’m driving it .. it is not the methods, the software, nor their misapplication – but the application ‘in toto’ of methods designed to describe aggregate ‘behaviors’ being used to model “cause”. The very point made by the late David Freedman.

Using aggregates is fine and dandy when you want to make claims about aggregates (as I do many times in the commercial and decision-theoretic world), but hopeless when you want to make claims which are meant to apply to all members of a specified population or species (such as a causal model). But then, we do not even agree on what constitutes a “causal model”.

And that’s just how big the difference is between us – and why I agree with Joel Michell (1999) that psychometrics is a pathology of science. Quibbling over methods or their misapplication is the very least interpretation of that particular word “pathology”.

One way I tried to bring home the futility of aggregate data model statistics was to use an example from actuarial risk prediction of sex-offence recidivism, within forensic psychology:

There is a very fine paragraph in a paper by Stephen Hart (2003), in a response to a target article by Berlin et al (2003) …

“Probability estimates based on group data may not reflect the “true” probabilities for any individuals in the group-the same way that the mean test score for a group of people may be a score obtained by no one in the group. The only times when group data are clearly applicable to individual cases is when the probabilities approach either null or unity (i.e., 0 or 100%). Consider a group of 100 offenders, 50 of whom recidivate within 5 years. Does this mean that every member of the group had a 50% chance of recidivism? Or that half had a 100% chance and half a 0% chance? Or perhaps 25 had a 100% chance, 25 had a 75% chance, 25 had a 25% chance, and 25 had a 0% chance…. There is simply no way to determine the answer at this time.”

While not wishing to dwell on the sensible counter arguments for the practical use of actuarial (statistical) evidence in the context of jurisprudence associated with utilizing such evidence — see Harris (2003) … the statement by Hart does bring home that cause and probability do not sit well together, unless the probabilities are very high, justified with powerful explanatory theory.

-references-

Berlin, F.S., Galbreath, N.W., Geary, B., McGlone, G. (2003) The use of actuarials at civil commitment hearings to predict the likelhood of future sexual violence. Sexual Abuse: A Journal of Research and Treatment, 15, 4, 377-382.

Harris, G. (2003) Men in his category have a 50% likelihood, but which half is he in? Comments on Berlin, Galbreath, Geary, and McGlone. Sexual Abuse: A Journal of Research and Treatment, 15, 4, 383-388.

Hart, S.D. (2003) Actuarial Risk Assessment: Commentary on Berlin et al. Sexual Abuse: A Journal of Research and Treatment, 15, 4, 383-388.

The big issue in SEM (structural equation modeling) is that the use of a single model-fit statistic (chi-square, P > 0.05) is rather forcefully argued as evidence that a fitting model is an explanatory causal model.

My point was that if the following statement might be considered valid, as Judea Pearl has been paraphrased … “Causality, in Pearl’s view, is a matter of individuals–a manipulation of variable x produces a change in variable y for every individual; otherwise, the model is wrong.”, then utilizing any statistical model which “explained” aggregate parameterized outcomes rather than individual outcomes was a pointless exercise IF the goal was to make claims about causality. This is simply not accepted by SEM modelers.

Anyway, it was in that context that the “sizeless” science seemed to be an interesting additional statement on how model-fit/statistical significance has now become the gold-standard, with predictive accuracy and substantive effects as “optional”. It is as Leo Breiman argued in 2001 (Breiman, L. (2001) Statistical Modeling: the two cultures. Statistical Science, 16, 3, 199-231).

The reality is for most practical work where you would like to make generalizations or predictions, you don’t need data-model statistics at all – just some idea what you are looking for, what the generating mechanisms might be, and a healthy dose of bootstrapping/randomization and cross-validation/replication to let you gauge the likely magnitude range of “chance effects”. Sometimes, a hypothetical sampling distribution is useful, relevant, and efficient. But in the social sciences dealing with attributes which are not even quantitatively structured .. and whose generating systems are complex-interactive in nature?

Anyway, just a bit of background!

Regards .. Paul

It depends on the purpose of your inquiry. If you wish to explore the contextual complexity from which values, principles, and truths naturally emerge, then I suspect that you will want to look at the overlapping “hybrid” space located between binary oppositions (e.g., the Third Space in Gutiérrez, 2008). That is where I discovered them. However, if you focus exclusively on the binary oppositions themselves as separate entities, then I suspect that you will eventually find yourself on the path to sizeless science where perception eventually becomes reality. I’m afraid that is the state of public education in the U.S. today.

Source: Gutiérrez, K. (2008). Developing a sociocritical literacy in the Third Space. Reading Research Quarterly, 43(2), 148–164.

Jane, you are absolutely right that the appropriate choice of significance level is context-dependent. However, it it very rarely the case that translating p<.05 into

“beyond reasonable doubt”is appropriate. As a general translation, something like“data suggest something might be going on here, worth investigating further”would be better. At what point I’d be prepared to use language like“beyond reasonable doubt”again depends on circumstances (how much is at stake, what do we have prior reasons to expect, etc), but typically perhaps around p<0.0001.Olle,

To determine a truth “beyond reasonable doubt” will always be relative to the perspective of the observer. It is only a half-truth because the truth of the observer is fundamentally imprecise and uncertain. I am, of course, referring to the uncertainty principle which refers not to the uncertainty of physical properties but rather to the nature of systems themselves. Sizeless science can be boiled down to one proposition: We have become so focused on the particle that we have neglected the wave.

Chad

I did a search on Amazon for the book and the reviews expose significant biases in the writing and approach – and definitely worth reading before deciding to buy the book:

http://www.amazon.com/Cult-Statistical-Significance-Economics-Cognition/dp/0472050079/ref=sr_1_1?ie=UTF8&s=books&qid=1288144385&sr=8-1

I replied to the second part of this post. My main point, as raised above is that 95% is not fixed. The appropriate confidence interval depends on both the data and the context. If you are working with small data sets and contexts where the “truth” can be approximate, then 80% or 90% may be appropriate. If you are working with larger data sets and contexts where certainty is important (e.g. in drugs) then 99.9% may be more appropriate.

Also, I would be alarmed at giving away significance testing. Particularly with small samples, it can be very important to discount apparently plausible and important patterns in the data and prevents conclusions being drawn that are not supported. The ‘null hypothesis’ question is do we have enough evidence to disprove the ‘status quo’ or ‘no change’ assumption. And in evaluation we do need to be able to say, we don’t know if there was change or not – rather than base conclusions on weak data.

The important issue raised here is that we should look at the whole picture – not just the p-values. So in reading anova or regression results, we should be looking at both the size and the significance and be able interpret both the importance and the certainty of the change.

And beyond this are much deeper issues of model specification and misspecification; misinterpretation of odds and probabilities and use of questionable constructs!

Whilst on the subject of books, I came across a review today of “Proofiness:The Dark Arts of Mathematical Manupulation”. An obvious take on Stephen Colbert’s notion of truthiness, it seems to cover similar ground to the Ziliak and McCloskey book, but more focused on our reverence of anything with a number attached. I have to say that I still cling to my battered copy of Reichmann’s Use and Abuse of Statistics; a little gem that I bought back in the late 60s during my early encounters with the trade.