Usability testing myths
Rolf Molich debunks some commonly held misconceptions about the efficacy and appropriate application of usability testing
Usability testing is by far the most widely used usability evaluation method. Nonetheless, it’s often conducted with poor or unsystematic methodology and so doesn't always live up to its full potential. This article presents five controversial beliefs about usability testing and discusses if they are myths or if there is some useful truth to them. The discussion leads to practical advice on how to conduct better, faster and cheaper usability tests.
I have listed the beliefs in the table below. Before you read on, I suggest that you pause and deliberate. Please mark your opinion about each of these beliefs. Which are correct, and which are myths?
|
Beliefs |
Correct |
Partly correct |
Myth |
|
1. Five users are enough to catch 85% of the usability problems in practically any product |
|
|
|
|
2. The main goal of a usability test is to discover usability problems |
|
|
|
|
3. Usability tests provide results that are more reliable than those from expert reviews |
|
|
|
|
4. Positive comments in a usability test report are useless because they are not actionable |
|
|
|
|
5. Usability testing can be conducted by anyone |
|
|
|
The Comparative Usability Evaluation (CUE) studies
The results reported in this article are based on Comparative Usability Evaluation studies and the author's experience from conducting quality assurance of professional, commercial usability tests.
In a Comparative Usability Evaluation study, teams of experienced usability professionals independently and simultaneously conduct a usability study using their preferred usability evaluation method (most often usability testing or expert review). Their anonymous test reports are distributed to all participants, compared and discussed at a one-day workshop. Websites that we've tested include Hotmail.com, Avis.com and the website for the Hotel Pennsylvania in New York.
The first CUE study took place in 1998. Until now, nine CUE studies have taken place with more than 120 participating usability professionals. The number of participating teams has grown from four in CUE-1 to 35 in CUE-9.
The key purpose of the CUE studies is to determine whether results of usability tests are reproducible. Another purpose is to cast light on how usability professionals actually carry out user evaluations. After the first studies clearly showed that usability test results are not reproducible, we started to investigate why and what could be done to make results more similar.
More information about the CUE studies is available at DialogDesign.
1. Five users are enough to catch 85% of the usability problems in practically any product
Myth. The CUE studies consistently show that the number of usability problems in most real-world websites is huge. Most CUE studies found more than 200 different usability problems for a single state-of-the-art website. About half of them were rated serious or critical.
Few teams reported more than 50 problems simply because they knew that reporting more than 50 problems is unusable. Many teams tested eight or more users and reported 30 or fewer problems – just a fraction of the actual number of problems.
Even in CUE studies with 15 or more participating teams, about 60 per cent of the problems were uniquely reported. Chances are that if we had deployed 100 or even 1,000 professional teams to test a website, the number of usability problems found would have increased from 200 problems to perhaps 1,000 or more.
So when you conduct a usability evaluation of a non-trivial website or product, most likely you will only find and report the tip of the iceberg – some 30 random problems out of hundreds. Even though you can't find or report all problems, usability testing is still highly useful and worthwhile, as I will explain later in this article.
This myth is hard to defeat, partly because the usability guru Jakob Nielsen kept promoting it until a few years ago. His website still says it, at least indirectly, in the graph:
I agree with Jakob that you only need to test with five users. But the correct reason is that five users are enough to drive a useful iterative cycle. In other words, once you have conducted five test sessions, stop testing and correct the serious problems you have found. Then conduct additional test sessions if time and money permit.
Never claim that testing will reveal all usability problems in a non-trivial product.
2. The main goal of a usability test is to discover usability problems
Myth. The primary reason for conducting usability tests should be to raise awareness among stakeholders, programmers and designers that serious usability problems exist in their own product. Some development teams believe that usability problems only exist in other people's products.
Of course, we also conduct usability tests to discover usability problems so they can be corrected. But usability testing is an expensive way of finding usability problems, so finding problems should not be the only purpose for conducting a usability test.
Use usability tests to motivate your co-workers to take action to prevent usability problems.
3. Usability tests provide results that are more reliable than those from expert reviews
Myth. The CUE-4, 5 and 6 studies compared results from usability tests of a website to results from expert reviews of the same website. The studies clearly showed that there were no significant differences in result quality. Actually, results from expert reviews were slightly better and cheaper to obtain than usability test results.
It's a myth that usability testing is the gold standard against which all other methods should be compared. The CUE studies have shown that usability tests overlook problems, even serious or critical ones, just like all other usability evaluation methods. There are several reasons why usability tests are not perfect. One of them is that they're often conducted with poor or unsystematic methodology. Another one is that the test tasks often do not adequately cover all important user tasks. A third reason is that test participants frequently are not representative.
This CUE result only applies for reviews conducted by true usability experts. It takes many years (some say 10 or more) and hundreds of usability tests to gain the experience and humility necessary to conduct fully valid expert reviews.
In an immature organisation, inconvenient expert review results may be brushed aside by the question, "These problems are your opinion. In my opinion, users would not have any difficulties with this. Why are your opinions better than mine?" Usability tests are much better than expert reviews in convincing skeptical stakeholders about usability problems. It's easy to question opinions. It's hard to observe one representative user after the other fail a task completely without admitting there are serious usability problems.
Expert reviews are valuable, but they are also politically challenging.
4. Positive comments in a usability test report are useless because they are not actionable
Myth. Do you appreciate occasional compliments about your work? Sure you do. That's why usability test reports should contain a balanced list of both positive findings and problems.
Usability test reports most often tell inconvenient truths. Substantial, positive findings serve at least two purposes: they prevent features that users actually like from being removed and they make it easier to accept the problems. Remember: even developers have feelings.
At least 25 per cent of the comments in a usability test report should be positive.
5. Usability testing can be conducted by anyone
Correct. Anyone can sit with a user and ask the user to carry out some tasks on a product. So, in principle, anyone can do usability testing.
But quality usability testing that delivers reliable results is different. Quality usability testing requires skills like empathy and curiosity. It requires profound knowledge of recruiting, creating good test tasks, moderating test sessions, coming up with great recommendations for solving usability problems, communicating test results well, and more. The CUE studies have shown that not every usability professional masters these skills.
We need to focus more on quality in usability testing.
Better, faster, cheaper!
Project managers tell me that, just like everyone else, usability evaluators need to improve their efficiency. My suggestions are:
Better
Today, a usability test should be considered an industrial process. Gone are the days when a usability test was a work of art and beyond criticism. Industrial processes are controlled by strict rules that are written down, reviewed and observed.
Strict rules enable quality assessment of our work. As responsible professionals we should welcome this.
Faster
Test with between four and six test participants. More test participants are a waste of time since it's an elusive goal to find 'all' problems – or even all critical problems.
Focus on essential results. Write short reports that can be released quickly, ideally within 24 hours after the final test session.
Cheaper
You can't control what you don't measure. Measure cost and productivity of your usability tests. Keep a timesheet so you always know exactly how much a usability test cost your company. Compare your productivity and quality with your peers.
It's hardly ever cost-justified to have two or more specialists working on a usability test. You could argue that a usability specialist may overlook important problems that a co-worker would notice, but the CUE studies show that, even if you deploy 10 experienced usability specialists, important problems will be overlooked.
Consider carefully whether an expensive usability lab is cost justified when two ordinary meeting rooms with inexpensive TV equipment will make it possible to observe usability tests equally well.
Consider remote or even unattended usability testing to reduce costs. Recent CUE studies indicate that these methods work almost as well as traditional usability testing.
Prevention is better than cure
My personal experience is that about half of the problems I find from usability testing are violations of simple usability heuristics that we've known for more than 20 years such as 'Speak the users' language', 'Provide feedback' and 'Write constructive and comprehensible error messages'.
Usability tests are expensive. They are an inefficient way of discovering usability problems. Many of the problems uncovered by a usability test should never have occurred in the first place – they should've been prevented by the designers' or programmers' knowledge of basic usability rules.
The main lesson from myth two is: use usability tests to motivate your co-workers to take action to prevent usability problems. In other words, consider designers and programmers as the primary users of your usability test results. Let's use usability testing to motivate our primary users to learn about common usability pitfalls. Let's focus on preventing problems rather than curing them.
Main image used courtesy of illustir via Flickr under the Creative Commons License




9 comments
Comment: 1
Usability Tests are as much for the observer as the user. This is important. Getting to know your user and think like them is just as or more important than compiling statistics on error rates.
Usability Test Reports are a good way to get management involved in the process and to think about actual users not just list of features and marketing statistics.
Giant cumbersome bureaucratic six month long statistical analysis type usability tests are rarely appropriate for modern agile based development teams. But that doesn't mean there should be no testing at all.
Some serious flaws in this article:
Expert Reviewers (e.g., UI Designers or Subject Matter Experts) are better than usability testing at finding design problems. This might be true for obvious problems. But if you want to be an innovative software designer and find out what trips your user up using your software there is no substitute for spending time with people who lives you actually effect with you work.
If your user testing is uncovering 200 - 1000 SERIOUS usability problems you should maybe consider a career in fast food and get out of the software business as fast a possible. I think the point of contention here might be the definition of serious. There are always many subjective and objective problems with a particular software design. Finding out what is important to the users success is my goal.
Strangely at the end of the article the author seems to reverse himself and say how important testing is. So I'm not sure what the point overall is except maybe to bash Jacob Neilsen?
Comment: 2
I agree that the findings in this article are controversial. They may also be inconvenient for some people. The reported findings are based on 9 serious, large scale studies carried out since 1998. They are called CUE for Comparative Usability Evaluation. The results may be hard to believe. That's why we have run repeated studies to double and triple check the results. Our results and analyses have been peer reviewed and published in respected magazines like BIT - Behaviour and Information Technology and JUS - Journal of Usability Studies. The article contains a link to my company's website where you can find details.
I am arguing based on facts and results of 9 serious studies with 120+ highly respected and experienced usability professionals. What are your objections based on?
I understand that you have more objections against the article. I would be grateful if you would bring them forward - either in this discussion forum or directly by email to me.
From your LinkedIn profile I understand that you are directly connected to Jared Spool and Louis Rosenfeld. I respectfully suggest that you ask these two people about the credibility of the CUE-studies.
One final point: My friend and co-worker Jakob Nielsen appreciates when people spell his name correctly.
Thank you for taking the time to voice your concerns. I appreciate it.
Best regards, Rolf Molich.
Comment: 3
Comment: 4
I have question though about No. 1.
I would be interested in learning more about the particulars of the research methodology used. The results discussed are quite different from other studies I’ve seen.
Have you read Faulkner (2003)? (The full reference is below.)
She did a very nice empirical evaluation of Nielsen’s 5-user claim and found that though it’s true that you typically find ON AVERAGE 85% of the usability issues in a system by running 5 users, that the RANGE is 55 to 100%.
In her study, running 10 users found 95% of issues and upping that to 15 users only increased the average percentage to 97%.
Faulkner, L. (2003). Beyond the five-user assumption: Benefits of increased sample sizes in usability testing. Behavior Research Methods, Instruments, & Computers, Vol 35, No. 3, 379-383.
Comment: 5
Comment: 6
Thanks for your interest in our studies.
The section "The Comparative Usability Evaluation (CUE) studies" in the article provides a link to further information about the CUE-studies on which the article is based.
Laura Faulkner's study tested 60 participants with one task. It's not clear from the article who moderated the test sessions and who analyzed the results. I assume that one person did all this.
Our studies clearly show that many more problems show up as soon as you start varying the moderator, the analyzer, and the usability test tasks. As far as I can see, Laura did not do this. So her results are probably correct and they do no contradict our results. If one person does a usability study with the same task set throughout the study, then 5-10 users are probably enough to find most of the problems that this moderator is able to find.
Our CUE-4 study tested the website for the Hotel Pennsylvania in New York. 17 teams independently and simultaneously evaluated the website. The maximum overlap between any two teams was 30 issues. It occurred between team H, which reported 67 issues, and team M, which reported 56 issues. The minimum overlap was just one issue. Team J, which reported 19 issues, had only one issue in common with team S, which reported 40 issues. Team J had only two issues in common with team D, P and R, which reported 40, 32 and 26 issues, respectively.
In other words: If you do a usability study of a product, and a competent colleague of yours independently and simultaneously does a similar study, chances are that you will most likely report rather different findings. Of course, that does not mean that any of the findings are invalid. It simply means that different people see different problems. This is known as "the evaluator effect".
Even though few of the participating teams reported more than 70 findings, the overlap was so limited that the total number of findings consistently accumulated to 200+ findings. Many of these findings were reported by single teams only. Even though they were uniquely reported, few of these findings appeared invalid. And yes, some of the uniquely reported findings were classified as "critical" by the team that reported them, and I have no reason to doubt their classification.
In our latest study, CUE-9, we asked 35 teams to independently evaluate the same 5 video tapes of test sessions of the U-Haul.com website. Again, reported findings varied wildly, although not as wildly as when each team created their own tasks and moderated the sessions themselves.
Best regards, Rolf
Comment: 7
Thank you for your excellent and very helpful reply. I write a popular UX blog on the Intel Corporation intranet that’s read by thousands of people. I’ll definitely be discussing and referencing your post.
The results you discuss raise a fascinating question.
First, I know you’re well aware there are studies going back 2 decades showing that heuristic evaluation often outperforms usability studies at identifying usability issues. (In fact I think you did some of them!)
This is given the caveat that they’re performed correctly. One paper I’ve read that I can’t find the reference for () discussed a meta-analysis of studies comparing different UIMs and found that the most effective is heuristic evaluation—given that that the evaluators are experts, not novices, that you have several expert evaluators independently conduct their HEs and that they HEs are done in an unstructured way (not using a template) etc.
So, the question….
If, as your data suggest, usability study results vary so greatly due to differences in the tasks tested, who the moderator was, who the analyzer was etc., does this, in and of itself, speak to the relative ineffectiveness of usability studies?
To me it would especially if the results of expert HEs are less variable, if the results of HEs conducted independently by experts have a greater degree of overlap than the overlap found in usability tests.
Do you have any data on this?
Thanks,
Charles
Comment: 8
Thanks for your continued interest in our studies and my opinions.
I distinguish sharply between expert reviews and heuristic inspection. Expert reviews are carried out by experts based on their usability or domain expertise. Heuristic inspections can be carried out by anyone based on a limited number of usability heuristics, most often about 10 heuristics. Many people base their heuristic inspections on the heuristics that Jakob Nielsen and I came up with around 1990, and which Jakob later improved.
We found that none of the professionals in the CUE-studies did heuristic inspection. Most of the participating professionals said up front that they had done an expert review. We found that the few that claimed that they had done a heuristic inspection actually had done reviews or expert reviews. We concluded this after realizing that many of their fully valid usability findings could not have been found by using any of the 10 heuristics that they claimed that they had been using.
Results from true heuristic inspections in my opinion are of little value because an orthodox heuristic inspection limits you to the 10 or so heuristics that you have chosen. Problems that cannot be attributed to any of the heuristics should not be reported because they could be false positives. The original idea behind the heuristic inspection method was that it would allow laymen to discover a limited number of usability problems, and not necessarily the most serious problems.
Expert review results are as good as usability test results. I have no data to show that they are better, and I don't believe they are.
The reason why different moderators and different tasks find different usability problems is that there are so many different usability problems in today's complex products. You could compare a usability test or an expert review to a drawing of 30 balls from an urn that contains 300+ balls. No one would be surprised if two people came up with different balls from separate drawings. So results from two independent usability test results differ even when they are both executed completely correct. It's not a quality problem.
CUE-4 showed that results from independently conducted expert reviews differ as much as results from usability tests.
You may want to take a look at our CUE-4 paper, which explains this in much more detail: Comparative usability evaluation (CUE-4), Rolf Molich & Joseph S. Dumas, Behaviour & Information Technology, Volume 27 Issue 3, May 2008, Pages 263-281. Please contact me by email if you want me to send you a copy.
Best regards, Rolf
Comment: 9
Charles