Measuring usability with a single score
Like most authority figures, senior managers are often subject to a certain degree of sarcasm. A popular depiction with programmers is the pointy-haired boss in the Dilbert cartoons. But every now and again we come across a manager with a sharp mind who points us in a direction we hadn’t thought to go.
In a recent client meeting, for example, we were discussing usability changes to a warehousing database system. Predictably, we encountered a list of reasons why our suggested design changes couldn’t be put into practice. In fact, there were almost knee-jerk objections: “We can’t change the back-end process,” “We don’t have the budget to make these changes,” and “That won’t be a problem for our users because they’ll have documentation.”
So we were taken a bit by surprise when the boss asked, “On a scale of 1 to 100, how usable is our system?”
We knew what he meant, of course. He wasn’t after a precise number, like 63 or 87. He wanted a qualitative feel for how usable we felt the system to be. After all, most people perceive usability as a qualitative discipline and characterize it as fuzzy and opinion-based. “About a 70â³ would probably have satisfied him. But it got us thinking: Could we specify the usability of his system precisely and characterize it with a single number? Could usability be measured and tracked just like any other engineering attribute?
The answer is yes. But to measure and track an attribute, we must first define it, and the definition we prefer appears in the international standard BS EN ISO 9241. This ISO standard defines usability as:
Extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use.
Note the three components in this definition:
- Effectiveness: The accuracy and completeness with which customers achieve specified goals.
- Efficiency: The accuracy and completeness of goals achieved in relation to resources.
- Satisfaction: Freedom from discomfort and positive attitudes toward the use of the system.
The term satisfaction in the definition of usability has been criticized for conveying the idea of being adequate or just good enough, which is hardly a design goal worthy of aspiration. This is more an artifact of current usage: The dictionary defines satisfaction as “the feeling of pleasure that comes when a need or desire is fulfilled.”
This definition provides us with precise and objective terms to which we can assign numerical values. With this approach to evaluating usability, the boss’s system can be assessed strictly in terms of customer performance and satisfaction, where the customer is the critical part of the system. This is in contrast to evaluation by functional testing, which usually tests the system without end-user involvement.
It is important to note that performance and attitude measures are often independent. That is, a system may be effective and efficient, but the customer may have a poor opinion of it. Therefore, all three aspects of usability must be addressed in the design, development and evaluation of the system to optimize the customer experience.
Now that we have defined usability, we can start to measure it. Let’s begin with effectiveness.
Effectiveness is the accuracy and completeness with which customers achieve their goals. Imagine that we are conducting a usability test with 10 participants. We give each of the participants a task to carry out. For a warehouse, this might be to use the database to track the location of an item in the supply chain. (Note: Tasks must be realistic. Also, test subjects may be offered various incentives for their participation.)
Let’s assume that two of our participants slip up and are unable to find the item they need to track, even though it exists in the system. Or perhaps they get bogged down trying to understand the terms in the supply chain. The result is that there are eight successful participants and two unsuccessful participants, or a task completion rate of 80 percent.
Efficiency is the accuracy and completeness of goals achieved in relation to resources. To come up with an efficiency rating, we need to measure how long people take to complete the task. The obvious way to do this is to measure elapsed time. But our experience shows that time measurements can be variable and affected by so-called test anxiety. That is, some participants work more slowly than they normally would because they feel they are under scrutiny.
So we prefer to measure the number of screens that the user needs to load to complete the task. We can then express this relative to a “perfect” user-one who completes the task exactly how the developers intended, without being sidetracked by usability problems. If, for example, the perfect user loads 10 different screens to locate the item in the warehouse as opposed to the test participant who loads 20 screens, we can say that the test participant has an efficiency score of 50 percent.
Measures of satisfaction are then taken using a questionnaire, examples of which can be found online. In our example, upon completion of the task the participants answer a questionnaire containing statements about the web site. They are asked to circle a number between 1 and 5, where 1 represents “strongly agree” and 5 represents “strongly disagree.”
To keep people from developing a response set-circling all the 1s or 5s, for example-half of the questions should be positively phrased (“I thought the system was easy to use”) and half should be negatively phrased (“I need to learn a lot more about this system before I can use it effectively”). A percentage is calculated by reverse scoring the negatively phrased questions, calculating the average score for each participant across all the questions, and then dividing each participant’s score by 5.
Dealing with variability
How robust is our measure of usability? Does it need 10 or 50 participants?
If we run a test with 10 participants one day and the same test the next day with 10 different participants, it’s unlikely that we will get exactly the same results. On the effectiveness score, for example, we might find only six participants who are successful, or perhaps all 10 complete the task. This variation is natural in behavioral testing, but it doesn’t mean that huge samples are needed to get robust data. We simply need to provide a confidence interval for our usability measure, in much the same way that pollsters provide a margin of error when predicting the result of a general election based on a sample of interviews.
With usability testing, we can achieve this result with some simple statistics. We can express our measured value and provide the 95 percent confidence intervals around this mean value. For more information on how to achieve this in practice and still use relatively small sample sizes, see “You don’t need a large sample of users to obtain meaningful data” (http://www.measuringusability.com/sample.htm).
A single metric for usability
We now have our three components of usability, each expressed as a percentage. By averaging these three scores, we can define the usability of the product with a number between 1 and 100. As well as satisfying the manager’s curiosity, we can use this score to answer some fundamental questions:
- How usable is this system?
- How do we rate against the competition?
- What will qualify as “usable enough”?
- How much more usable must the system be?
- How will we know if our system is more usable?
- How much will it cost to make it more usable?
- How much additional revenue will this added usability yield?
Developers are working in a culture that values measurement. Things that are measured get done, and things that can’t be measured tend to be ignored. If you want your project to be given priority, consider calculating its usability score. Your users (and perhaps your pointy-haired boss) will thank you.