Breaking out of the Likert scale trap

A recent conversation with a colleague has reminded me of how traditional social science training has managed to hardwire our brains into some default thinking that needs to be questioned.

Obviously, there are a lot of places one could go with this as an opening statement, but for now, let’s look at the design of survey questions.

Are we unknowingly being caught in a Likert scale trap?

I mentioned a while ago, in a discussion following a Friday Funny post (Data as “the truth”), that I generally avoid the use of Likert scales because they are evaluatively uninterpretable. Andrew Hawkins asked me why. So, belatedly, here’s why.

Real, genuine evaluation is evaluative. In other words, it doesn’t just report descriptive evidence for others to interpret; it combines this evidence with appropriate definitions of ‘quality’ and ‘value’ and draws conclusions about such things as:

  • the quality of program/policy/product/etc design and implementation
  • the value and practical significance of outcomes
  • whether the entire evaluand was a good (or, the best possible) use of time/money/resources or not

Now, as I’ve mentioned before (in the ‘No Value-Free’ post), many so-called evaluations are what we call ‘value-free’, a.k.a. “evaluations NOT”! They skip this whole evaluative inference step. My view: This is not acceptable. JMHO.

The usual alternative is to take descriptive evidence, often gathered using traditional social science methods, and attempt to interpret it relative to the relevant definitions of quality and value.

This sounds straightforward enough, but it’s actually quite tricky. Over the years, I have worked on some ways of making it easier.

Like, why not build evaluative elements into survey questions themselves?

Building evaluative elements into survey questions

A typical survey/questionnaire might ask questions like:

To what extent do you agree or disagree with the following:

disagree neutral agree strongly
The course was well organized 1 2 3 4 5

OK, so what exactly does a mean of 3.8 (for example) mean? Is that well organized? Excellently? Mediocrely?

In some cases, the mean score (and distribution of scores) is provided alongside the mean and s.d. across a range of programs (this is very common with training programs). This does give an inkling of relative merit compared to others. In other words, you can see if you are generally doing better or worse than ‘the rest of the pack’.

But what if we want to know how good something was in some absolute sense? We might be better than average, but is that actually any good?

Most ‘evaluations’ seem to use the so-called Rorschach inkblot approach (a.k.a. the value-free, or “you work it out” approach). Basically, this means presenting descriptive data such as the above (possibly including comparison data) and letting clients and stakeholders draw their own conclusions about how good the findings are.

If we do want to take it that one step further, to say something explicit about the quality or value of something, how do we do that? Invent cut-offs, e.g. saying that 3.5-3.8 is good, 3.8-4.2 is very good, etc? What would be the basis for these?

Or, what if we ditch the agree-disagree response scale and opt for something that has evaluative terms built right in. Like, for example …

How would you rate the following:

poor /

good very
How well the course was organized 1 2 3 4 5

Now, it has to be said, simply reporting summaries of participants’ ratings is not, by itself, “doing” an evaluation. The evaluator still needs to draw an overall conclusion.

However, by using evaluative terms right in the questionnaire, the participant ratings become a lot easier to interpret in terms of quality or value.

I use item designs like these primarily for process evaluation – getting a handle on the quality of content, design, and implementation. They are far more evaluatively interpretable than the traditional Likert scale-type items.

Later this week, stand by for ideas for outcome/impact evaluation that build causation right into the items.

12 comments to Breaking out of the Likert scale trap

  • Chris Giffard

    A few years back I read an article by someone who had changed the Likert options considerably. The two extremes would be, for example, “I wish I’d stayed at home”, and “I could happily attend a workshop like this every week”. I’ve been looking for the article for a while now, and can’t find it. Can anyone assist?

  • gilles mireault

    Hi Jane,

    Nice posting. I very much appreciate this kind of simple, straight to the heart, evaluative thinking.

    Recently, I introduced the notion of evaluative questions in a conversation with a collegue and she “clicked” about the difference between a “research” question and an “evaluation” question. I am eager to read more from you on how to be more and more evaluative in our work.

    Thanks for sharing your experience and knowledge.

  • K Fisher

    I think there are good arguments for asking and using systematised stakeholder judgements on the value of elements of the evaluand in question as part of an evaluation. This is not least because meeting stakeholder political/value interests and their contentment is often an intrinsically valued or critical element for success in itself.

    But I’m not convinced that using ‘value’ questions as opposed to ‘descriptive’ questions is going to make value criteria sufficiently transparent or clear, particularly if used without also asking/using descriptive questions or other data collection. In effect what you are doing is getting the overall judgement of each stakeholder on a particular question, eg sufficient organization, without knowing what kinds of criteria were used by each person to make their judgement. In other words, you don’t know what they think comprises their judgement nor how diverse the criteria and standards were for overall compilation judgements ranked the same. This difference may be politically overt or it could be an as yet unrecognised difference among stakeholders. In some cases knowing about these differences may not matter but in others it could be rather important – particularly if improvements are to be made.

    I think one of the valuable things about developing ranking scales (say for organizational purposes) is that it forces people to consider what elements comprise their judgements of amorphous, ambiguous or contentious ‘values’ such as ‘quality’. It can be valuable to do this in a participatory way because the process itself could identify many of the areas of contention – different perspectives and priorities. But if standardisation of judgements is what is wanted then a good aid for this in organizations is to use case exemplars assigned to the levels of each criterion. This helps to discipline and harmonise the Likert Scale judgements and over time create a social practice norm.

  • Catherine Nelson

    Thanks Jane for this thoughtful, practical and very timely posting. Causing me to re-think a survey I am developing right now!

  • David Earle

    I tend to agree with K Fisher on this. Ranking scales have their place, if used wisely, within the evidence that supports an evaluation.

    Some years ago, I was involved with a series of projects eliciting customer feedback on service delivery. We used interviews and focus groups, and added a short survey with ranking scales to the last of the projects. We were surprised at the high rankings that people gave to the level of service they received – with contrasted with their rather harrowing descriptions of difficulties they had encountered at times. This had a big impact on our evaluative conclusions.

    I do have a problem with ranking scales being used to get to single, summary judgements about a programme or service. And some of they ways they are analysed statistically. Which is what Jane is getting at.

    However, they can be a good way of getting standardised responses on specific aspects of the programme or service from a large number of people. In my view, they work best if the questions are descriptive, cover important aspects of the programme or service and been tested with respondents beforehand. And used as one of several sources of information.

    Jane also raised the issue of quantifying the results. There are (at least) two ways of going about this. One is using a short scale (5 or fewer response levels) – in which case the responses should be analysed as ordinal, categorical data. Assigning numbers and averaging them is not particularly valid. The other is using a longer scale, with numbers (e,g 1 to 10 or 0% to 100%) and descriptors at either end. These can then be treated as continous variables. The choice really depends on the questions, the circumstances and the analysis to be undertaken.

    Another thing to consider is that these scales have been so overused in some places (such as customer service and training) that there is a high level of respondent cynicism about them. So finding fresh ways of collecting data is likely to give much better insights.

  • Many thanks to all for your responses!

    Chris, I always love a response scale with a bit of entertainment value, and very much like the sound of that one! See also a previous Friday Funny: ‘No Value-Free’ in college course evaluation forms.

    K Fisher and David correctly point out that global judgments from participants and other stakeholders may not always dig deeply enough under the evaluative terms to really get a sense of what drives the judgements.

    How far one needs to dig under the criteria (e.g., what ‘well organized’ means for a training workshop) depends on what is being evaluated and how much the details matter. For a simple workshop evaluation, one question will often suffice (after all, there are many more to ask, and space is at a premium); for a more extensive program or policy, it may be very important to get a clear sense of what lies underneath people’s thinking about quality. [My hunch is that qualitative methods will be important if that’s the case.]

    Just to clarify something I wasn’t clear on, I’m not suggesting that global ratings from stakeholders are all that should be used; just that they can often be an improvement on the traditional (and IMHO often deathly boring) agree/disagree items, which I find very hard to interpret evaluatively. [Perhaps I am alone in this? I suspect not.]

    Also not clear in hindsight was what I mean by the term ‘descriptive’. I mean non-evaluative, i.e., not asking directly about quality or value. It’s perfectly possible (and often very useful) to have evaluative survey items that contain a lot more information/detail, that flesh out what is meant by the evaluative terms used. Evaluative rubrics are a good (if extreme) example of this – and you all remind me I probably need to do a few posts on that topic!

    As a side note, I call the example I gave a ‘rating’ scale (where ‘rating’ is a simile for ‘grading’) because it uses ‘absolute’ rather than ‘relative’ quality/value terms.

    A ‘ranking’ scale is IMHO quite different, using terms that refer to relative value (e.g. well below average, below average, average, above average, outstanding).

    I realize these terms are often used interchangeably by lay people, but think it’s worth keeping the distinction clear, as evaluators – in the same way as we shouldn’t confuse complicated and complex evaluands, outputs and outcomes, or formative and process evaluation.

  • David Earle

    You are quite right Jane, we are talking about rating scales here. I had misused the term ranking scale. Although my take is a rating scale is any ordered scale that can be represented grades from “good” to “bad” and is applied to one item at a time. A ranking scale involves asking respondents to assign rank values to a set of items.

    On the main point of the blog, I am still not convinced that asking for an overall evaluative rating of a programme gives any more interpretable information. But then I am more interested in why people come to their judgement than the judgements they come to. So my be an inquiry bias on my part.

  • Thanks for the post and you articulate the need for ‘valuing’ so well. I have been trying to help some folks to stay away from the Likert scale because it does not embed the’quality’ and ‘value’ realm of evaluation, thereby skewing how conclusions are drawn. The indigenous populations I deal with like to evaluate their programs based on their values. Hence, the ‘strongly dislike to strongly like- thinking’ seems alien and individualist as it does not consider the ‘inclusiveness of the community and what they ‘value’ as a ‘whole’. Depending on what I am evaluating, I now use value added criteria such as the ones you propose. I find it useful to define together with the group what they mean by ‘poor to excellent’ and ‘impact to substantial impact’. While these processes take time, they are worth every penny used by the program.

  • Very nicely explained. I have second thoughts on using likert scales.

  • Laurie Gavrin

    I have for five years now wanted to use an affect / activation measurement as is commonly found in mood research: do you feel positive or negative about an experience and do you feel strongly about it or not: a strong positive feeling would be elation, a weak positive feeling would be contentment. A strong negative feeling would be anger; a weak negative feeling would be sadness. If one is running a training session, and the goal is to move people to action, the assessment should attempt to capture whether the participants are energized to go forth and apply what they’ve learned.

  • Thank you for this great contribution. However, the suggested label categories are not balanced. As there are three positive and two negative categories, the likelihood / probability of a positive result tends to be higher. For preventing that bias, I would just suggest changing the proposed labels from:

    – “poor / inadequate”;
    – “barely adequate”;
    – “good”
    – “very good”
    – “excellent”


    – “very poor/very low”
    – “poor/low”
    – “regular”
    – “good”
    – “very good”.

    One should also include the categories “Not sure, I don’t know”, “Not applicable” and sometimes “I do not want to answer”, depending on the questions. The numeric scale can integrate these new categories depending on the question (e.g., answering not applicable or reporting not to know the action under evaluation can also indicate the quality of its outreach and impact).

    The numeric interval should also be balanced. For a scale from 1 to 5 (one being the worst case, as in the article, or the other way round), the interval from the function “cut” in R (statistical computing language) is:

    (0.996,1.8] (1.8,2.6] (2.6,3.4] (3.4,4.2] (4.2,5]

    This would be equivalent to:

    from 1 to 1.80: “very poor/very low”
    from 1.81 to 2.60: “poor/low”
    from 2.61 to 3.40: “regular”
    from 3.41 to 4.20: “good”
    from 4.21 to 5.00: “very good”

    In the survey introduction, it can help to make survey participants aware about the risk of providing biased answers. I usually use the following paragraph for that:

    “Respondents in such questionnaires sometimes repeat the same answers for different questions, mark extreme answers trying to be polite or as form of calling attention to a specific aspect, or even rate items in the middle categories in order to keep neutrality when they are actually thinking something else. Please avoid this as much as you can, as it prevents us from understanding the real situation.”

    Thank you again for the contribution! I benefited a lot from it and I thought it would be good to try to contribute as well.

    Best regards from Bremen