What to do if SUS scores contradict qualitative feedback?Using the same participantsSlide out panel or new pageSUS Scores and DoubtsCan SUS scores be used for comparison between different systems?Comparing the WAMMI and SUS scalesShould different users always be used for successive System Usability Scale testing?Can I test different devices when I'll only have a sample of around 10 participants?How much margin (if any) should I allow for in calculating SUS?Example SUS scores / proof of outcome?Should I tell interviewees remaining time for tests?Best way to nudge stubborn usability testers?SUS evaluation on the same system but different sections and participants
Why use a retrograde orbit?
Why would you put your input amplifier in front of your filtering for and ECG signal?
Why is the marginal distribution/marginal probability described as "marginal"?
Why can't I share a one use code with anyone else?
What would a Dragon have to exhale to cause rain?
Rushed passport - does my reason qualify?
Is there an academic word that means "to split hairs over"?
How could it be that 80% of townspeople were farmers during the Edo period in Japan?
Why are there five extra turns in tournament Magic?
Is there a method to separate iron from mercury?
Is there any deeper thematic meaning to the white horse that Arya finds in The Bells (S08E05)?
Physically unpleasant work environment
Is it standard for US-based universities to consider the ethnicity of an applicant during PhD admissions?
Square spiral in Mathematica
How come Arya Stark didn't burn in Game of Thrones Season 8 Episode 5
How does the Heat Metal spell interact with a follow-up Frostbite spell?
Could a space colony 1g from the sun work?
Would life always name the light from their sun "white"
AD: OU for system administrator accounts
How does this piece of code determine array size without using sizeof( )?
Is Precocious Apprentice enough for Mystic Theurge?
Canadian citizen who is presently in litigation with a US-based company
Polynomial division: Is this trick obvious?
Promotion comes with unexpected 24/7/365 on-call
What to do if SUS scores contradict qualitative feedback?
Using the same participantsSlide out panel or new pageSUS Scores and DoubtsCan SUS scores be used for comparison between different systems?Comparing the WAMMI and SUS scalesShould different users always be used for successive System Usability Scale testing?Can I test different devices when I'll only have a sample of around 10 participants?How much margin (if any) should I allow for in calculating SUS?Example SUS scores / proof of outcome?Should I tell interviewees remaining time for tests?Best way to nudge stubborn usability testers?SUS evaluation on the same system but different sections and participants
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
TL;DR: qualitative data collected in a usability experiment seems to contradict the quantitative results of the SUS questionnaire. How can this discrepancy be reconciled?
The following experiment is conducted to evaluate the usability of a web-interface:
- Observe participants as they think aloud while using the interface to accomplish 8 tasks (the task order is randomized, this takes around 30 minutes)
- Give them a SUS form to fill out
- After they completed the survey, ask several follow-up questions to get more feedback (another 30 minutes)
So far, the experiment was conducted with 5 participants, then the UI was adjusted to address the found issues. A second round of 5 participants were then invited to go through the same steps.
It is planned to perform another round, with at least 5 participants (to obtain a sufficiently large sample). The current results are summarized below:
You can see that the v2 score is lower than v1.
These findings are puzzling, because:
- the qualitative feedback I got from participants was more positive in v2
the changes between v1 and v2 were not ground-breaking, e.g.:
- added tooltips to widgets
- increased the contrast to make the active tab more prominent
- changed wording to avoid technical jargon
- shortened text
nevertheless, these tweaks did polish the "rough edges" of v1, as it was clear from the observations that there was less friction while participants used the site
In other words, the changes were small incremental steps that should have yielded small improvements. The qualitative results match the expectations, while the quantitative data do not.
Since the overall average of 69 falls in line with the average SUS score of 68, it seems that nothing unusual has happened and we're testing "just an average interface". However, I am not sure how to reconcile the fact that the numbers contradict the humane feedback.
Nielsen says that qualitative feedback is more valuable and numbers can lead you astray. On the other hand, Sauro says that they do report SUS scores based on a sample of 5 users (as well as looks at the history of sample sizes, concluding that a minimum of 5 is reasonable).
At the same time, a t-test
says that the differences between the scores of v1 and v2 are not statistically significant.
How could one make sense of these results?
Thank you all for your comments, answers, and time. Although there is only one accepted answer, all the input is helpful. It enabled me to take a sober look at the data, and reduce the "jumptoconclusionness" factor to a lower level.
A note for future archaeologists: the question was edited to include details and statistics mentioned in the comments. It might help to look at the edit history to see the starting point and understand how it ended up like this.
usability-testing interview sus thinkaloud
add a comment |
TL;DR: qualitative data collected in a usability experiment seems to contradict the quantitative results of the SUS questionnaire. How can this discrepancy be reconciled?
The following experiment is conducted to evaluate the usability of a web-interface:
- Observe participants as they think aloud while using the interface to accomplish 8 tasks (the task order is randomized, this takes around 30 minutes)
- Give them a SUS form to fill out
- After they completed the survey, ask several follow-up questions to get more feedback (another 30 minutes)
So far, the experiment was conducted with 5 participants, then the UI was adjusted to address the found issues. A second round of 5 participants were then invited to go through the same steps.
It is planned to perform another round, with at least 5 participants (to obtain a sufficiently large sample). The current results are summarized below:
You can see that the v2 score is lower than v1.
These findings are puzzling, because:
- the qualitative feedback I got from participants was more positive in v2
the changes between v1 and v2 were not ground-breaking, e.g.:
- added tooltips to widgets
- increased the contrast to make the active tab more prominent
- changed wording to avoid technical jargon
- shortened text
nevertheless, these tweaks did polish the "rough edges" of v1, as it was clear from the observations that there was less friction while participants used the site
In other words, the changes were small incremental steps that should have yielded small improvements. The qualitative results match the expectations, while the quantitative data do not.
Since the overall average of 69 falls in line with the average SUS score of 68, it seems that nothing unusual has happened and we're testing "just an average interface". However, I am not sure how to reconcile the fact that the numbers contradict the humane feedback.
Nielsen says that qualitative feedback is more valuable and numbers can lead you astray. On the other hand, Sauro says that they do report SUS scores based on a sample of 5 users (as well as looks at the history of sample sizes, concluding that a minimum of 5 is reasonable).
At the same time, a t-test
says that the differences between the scores of v1 and v2 are not statistically significant.
How could one make sense of these results?
Thank you all for your comments, answers, and time. Although there is only one accepted answer, all the input is helpful. It enabled me to take a sober look at the data, and reduce the "jumptoconclusionness" factor to a lower level.
A note for future archaeologists: the question was edited to include details and statistics mentioned in the comments. It might help to look at the edit history to see the starting point and understand how it ended up like this.
usability-testing interview sus thinkaloud
4
Besides what the other people have already said, it's quite obvious that the result of v2 is massively set off by the outlier 37.5. This is a common issue when applying qualitative methods on a sample that's too small, if even one person produces extreme values, it automatically shifts the whole distribution.
– Big_Chair
May 11 at 18:52
Indeed, Nielsen wrote about this matter too, stating "For now, my best conclusion is that slow outliers are caused by bad luck rather than by a persistent property of the users in question.". So this is not uncommon; still, I can't just remove this result- that would be akin to omitting "inconvenient evidence". What would be an objective way to present these results?
– ralien
May 11 at 19:17
I am not 100% sure, but one approach could be to omit (or put less focus on) the calculated average value and instead treat each result individually, in conjunction with their qualitative results. I.e. "P1's feedback was ... with an accompanying score of 72.5" and only show the average at the end.
– Big_Chair
May 11 at 21:45
Or present them normally as they are but show them in a diagram as well and put strong focus on the outlier, while also presenting an alternative average value without it. I'd go with this type of diagram over a bar chart since the difference is more noticeable (might also just be my personal preference).
– Big_Chair
May 11 at 21:48
It may be more instructive to look at the median, especially when dealing with small samples of humans. Here you have 67.5 vs 70 for the medians, which are essentially the same.
– Chris H
2 days ago
add a comment |
TL;DR: qualitative data collected in a usability experiment seems to contradict the quantitative results of the SUS questionnaire. How can this discrepancy be reconciled?
The following experiment is conducted to evaluate the usability of a web-interface:
- Observe participants as they think aloud while using the interface to accomplish 8 tasks (the task order is randomized, this takes around 30 minutes)
- Give them a SUS form to fill out
- After they completed the survey, ask several follow-up questions to get more feedback (another 30 minutes)
So far, the experiment was conducted with 5 participants, then the UI was adjusted to address the found issues. A second round of 5 participants were then invited to go through the same steps.
It is planned to perform another round, with at least 5 participants (to obtain a sufficiently large sample). The current results are summarized below:
You can see that the v2 score is lower than v1.
These findings are puzzling, because:
- the qualitative feedback I got from participants was more positive in v2
the changes between v1 and v2 were not ground-breaking, e.g.:
- added tooltips to widgets
- increased the contrast to make the active tab more prominent
- changed wording to avoid technical jargon
- shortened text
nevertheless, these tweaks did polish the "rough edges" of v1, as it was clear from the observations that there was less friction while participants used the site
In other words, the changes were small incremental steps that should have yielded small improvements. The qualitative results match the expectations, while the quantitative data do not.
Since the overall average of 69 falls in line with the average SUS score of 68, it seems that nothing unusual has happened and we're testing "just an average interface". However, I am not sure how to reconcile the fact that the numbers contradict the humane feedback.
Nielsen says that qualitative feedback is more valuable and numbers can lead you astray. On the other hand, Sauro says that they do report SUS scores based on a sample of 5 users (as well as looks at the history of sample sizes, concluding that a minimum of 5 is reasonable).
At the same time, a t-test
says that the differences between the scores of v1 and v2 are not statistically significant.
How could one make sense of these results?
Thank you all for your comments, answers, and time. Although there is only one accepted answer, all the input is helpful. It enabled me to take a sober look at the data, and reduce the "jumptoconclusionness" factor to a lower level.
A note for future archaeologists: the question was edited to include details and statistics mentioned in the comments. It might help to look at the edit history to see the starting point and understand how it ended up like this.
usability-testing interview sus thinkaloud
TL;DR: qualitative data collected in a usability experiment seems to contradict the quantitative results of the SUS questionnaire. How can this discrepancy be reconciled?
The following experiment is conducted to evaluate the usability of a web-interface:
- Observe participants as they think aloud while using the interface to accomplish 8 tasks (the task order is randomized, this takes around 30 minutes)
- Give them a SUS form to fill out
- After they completed the survey, ask several follow-up questions to get more feedback (another 30 minutes)
So far, the experiment was conducted with 5 participants, then the UI was adjusted to address the found issues. A second round of 5 participants were then invited to go through the same steps.
It is planned to perform another round, with at least 5 participants (to obtain a sufficiently large sample). The current results are summarized below:
You can see that the v2 score is lower than v1.
These findings are puzzling, because:
- the qualitative feedback I got from participants was more positive in v2
the changes between v1 and v2 were not ground-breaking, e.g.:
- added tooltips to widgets
- increased the contrast to make the active tab more prominent
- changed wording to avoid technical jargon
- shortened text
nevertheless, these tweaks did polish the "rough edges" of v1, as it was clear from the observations that there was less friction while participants used the site
In other words, the changes were small incremental steps that should have yielded small improvements. The qualitative results match the expectations, while the quantitative data do not.
Since the overall average of 69 falls in line with the average SUS score of 68, it seems that nothing unusual has happened and we're testing "just an average interface". However, I am not sure how to reconcile the fact that the numbers contradict the humane feedback.
Nielsen says that qualitative feedback is more valuable and numbers can lead you astray. On the other hand, Sauro says that they do report SUS scores based on a sample of 5 users (as well as looks at the history of sample sizes, concluding that a minimum of 5 is reasonable).
At the same time, a t-test
says that the differences between the scores of v1 and v2 are not statistically significant.
How could one make sense of these results?
Thank you all for your comments, answers, and time. Although there is only one accepted answer, all the input is helpful. It enabled me to take a sober look at the data, and reduce the "jumptoconclusionness" factor to a lower level.
A note for future archaeologists: the question was edited to include details and statistics mentioned in the comments. It might help to look at the edit history to see the starting point and understand how it ended up like this.
usability-testing interview sus thinkaloud
usability-testing interview sus thinkaloud
edited yesterday
ralien
asked May 11 at 13:23
ralienralien
1,425917
1,425917
4
Besides what the other people have already said, it's quite obvious that the result of v2 is massively set off by the outlier 37.5. This is a common issue when applying qualitative methods on a sample that's too small, if even one person produces extreme values, it automatically shifts the whole distribution.
– Big_Chair
May 11 at 18:52
Indeed, Nielsen wrote about this matter too, stating "For now, my best conclusion is that slow outliers are caused by bad luck rather than by a persistent property of the users in question.". So this is not uncommon; still, I can't just remove this result- that would be akin to omitting "inconvenient evidence". What would be an objective way to present these results?
– ralien
May 11 at 19:17
I am not 100% sure, but one approach could be to omit (or put less focus on) the calculated average value and instead treat each result individually, in conjunction with their qualitative results. I.e. "P1's feedback was ... with an accompanying score of 72.5" and only show the average at the end.
– Big_Chair
May 11 at 21:45
Or present them normally as they are but show them in a diagram as well and put strong focus on the outlier, while also presenting an alternative average value without it. I'd go with this type of diagram over a bar chart since the difference is more noticeable (might also just be my personal preference).
– Big_Chair
May 11 at 21:48
It may be more instructive to look at the median, especially when dealing with small samples of humans. Here you have 67.5 vs 70 for the medians, which are essentially the same.
– Chris H
2 days ago
add a comment |
4
Besides what the other people have already said, it's quite obvious that the result of v2 is massively set off by the outlier 37.5. This is a common issue when applying qualitative methods on a sample that's too small, if even one person produces extreme values, it automatically shifts the whole distribution.
– Big_Chair
May 11 at 18:52
Indeed, Nielsen wrote about this matter too, stating "For now, my best conclusion is that slow outliers are caused by bad luck rather than by a persistent property of the users in question.". So this is not uncommon; still, I can't just remove this result- that would be akin to omitting "inconvenient evidence". What would be an objective way to present these results?
– ralien
May 11 at 19:17
I am not 100% sure, but one approach could be to omit (or put less focus on) the calculated average value and instead treat each result individually, in conjunction with their qualitative results. I.e. "P1's feedback was ... with an accompanying score of 72.5" and only show the average at the end.
– Big_Chair
May 11 at 21:45
Or present them normally as they are but show them in a diagram as well and put strong focus on the outlier, while also presenting an alternative average value without it. I'd go with this type of diagram over a bar chart since the difference is more noticeable (might also just be my personal preference).
– Big_Chair
May 11 at 21:48
It may be more instructive to look at the median, especially when dealing with small samples of humans. Here you have 67.5 vs 70 for the medians, which are essentially the same.
– Chris H
2 days ago
4
4
Besides what the other people have already said, it's quite obvious that the result of v2 is massively set off by the outlier 37.5. This is a common issue when applying qualitative methods on a sample that's too small, if even one person produces extreme values, it automatically shifts the whole distribution.
– Big_Chair
May 11 at 18:52
Besides what the other people have already said, it's quite obvious that the result of v2 is massively set off by the outlier 37.5. This is a common issue when applying qualitative methods on a sample that's too small, if even one person produces extreme values, it automatically shifts the whole distribution.
– Big_Chair
May 11 at 18:52
Indeed, Nielsen wrote about this matter too, stating "For now, my best conclusion is that slow outliers are caused by bad luck rather than by a persistent property of the users in question.". So this is not uncommon; still, I can't just remove this result- that would be akin to omitting "inconvenient evidence". What would be an objective way to present these results?
– ralien
May 11 at 19:17
Indeed, Nielsen wrote about this matter too, stating "For now, my best conclusion is that slow outliers are caused by bad luck rather than by a persistent property of the users in question.". So this is not uncommon; still, I can't just remove this result- that would be akin to omitting "inconvenient evidence". What would be an objective way to present these results?
– ralien
May 11 at 19:17
I am not 100% sure, but one approach could be to omit (or put less focus on) the calculated average value and instead treat each result individually, in conjunction with their qualitative results. I.e. "P1's feedback was ... with an accompanying score of 72.5" and only show the average at the end.
– Big_Chair
May 11 at 21:45
I am not 100% sure, but one approach could be to omit (or put less focus on) the calculated average value and instead treat each result individually, in conjunction with their qualitative results. I.e. "P1's feedback was ... with an accompanying score of 72.5" and only show the average at the end.
– Big_Chair
May 11 at 21:45
Or present them normally as they are but show them in a diagram as well and put strong focus on the outlier, while also presenting an alternative average value without it. I'd go with this type of diagram over a bar chart since the difference is more noticeable (might also just be my personal preference).
– Big_Chair
May 11 at 21:48
Or present them normally as they are but show them in a diagram as well and put strong focus on the outlier, while also presenting an alternative average value without it. I'd go with this type of diagram over a bar chart since the difference is more noticeable (might also just be my personal preference).
– Big_Chair
May 11 at 21:48
It may be more instructive to look at the median, especially when dealing with small samples of humans. Here you have 67.5 vs 70 for the medians, which are essentially the same.
– Chris H
2 days ago
It may be more instructive to look at the median, especially when dealing with small samples of humans. Here you have 67.5 vs 70 for the medians, which are essentially the same.
– Chris H
2 days ago
add a comment |
5 Answers
5
active
oldest
votes
How can this discrepancy be reconciled?
You have divergent results because the number of participants is small and not representative. There is no randomization or blinding to prevent bias. You're also not calculating the relevant stats. (What are the standard deviation, margin of error, confidence intervals, odds ratios, p values, etc?)
Further, you appear to be doing iterative design, not "experiments". There is nothing wrong with iterative design, but the data you collect are likely irrelevant beyond the current design. They cannot be used to meaningfully compare designs against each other. Even if they could, there aren't enough participants to measure the effect of small changes. But you don't need large numbers of users for iterative design. Just enough to identify improvements for the next iteration.
In an experiment, you'd have multiple designs A/B/C... tested in parallel. Participants would be randomized to the designs (as well as task order). Experimenters would not know which design individual participants were using. Experimenters would not observe participants directly. Experimenters would pre-decide what statistical tests are appropriate. They would not begin processing data until after it had all been collected. Etc. If you were testing drugs, your methodology (as well as insufficient participants) would likely prevent FDA approval.
How could one make sense of these results?
You did a t-test and found no significant difference. The "study" is likely underpowered with only five subjects in each group. Even if you had enough numbers to demonstrate significance, the study needs to be redesigned, and the survey has to be checked for reliability and validity.
The System Usability Scale (SUS) is described by its original developer as "quick and dirty". It appears to have been validated as a global assessment, but it's probably not appropriate for comparison. Imagine there were something known as Global Assessment of Functioning that physicians used to evaluate health. Is someone with condition A and GAF 85 "healthier" than someone with condition B and GAF of 80? Does it even make sense to compare A and B this way?
Even if these problems were all addressed, you are still doing iterative design. I would expect differences between successive iterations to be non-significant. Suppose you were testing drugs. Would you expect significantly different results between 100mg and 101mg doses? What about 101mg and 102mg? Etc. (How massive would n need to be to detect such minute differences?)
What to do... ?
Understand that iterative design is not experimentation. The value of small usability reviews is to screen for problems, not confirm success or produce stats.
Stop collecting (or "misusing") quantitative data when you know you won't have the numbers to demonstrate significance. Stop having "expectations", as it is a source of bias that can lead you astray. Redesign experiments to reduce bias.
... it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern.
That is as "expected".
New contributor
I updated the question with some details that I omitted initially: the task order was randomized, and the experiment is still in progress; the intention is to cover more participants in the last round, after considering the obtained feedback. I took some time to reflect on the intermediate results and posted my concerns here. 1. What do you mean by "blind" study in this case? Having the experiment conducted by different examiners? 2. The SUS is a standard survey, can I not trust its reliability and validity if I use it "as is"?
– ralien
May 12 at 0:51
3. Expectations, how can we not have them? What's wrong with expecting that once a button is made bigger, participants will be more likely to notice it? There is bias, that is why we have peer review, statistical tests to keep us in check & forums such as this one. You pointed out that I didn't compute the relevant stats; after taking a few steps in that direction, it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern. I think having expectations is not an issue, as long as you're aware of the traps and take steps to avoid them.
– ralien
May 12 at 1:13
I'd like to tap into your knowledge a bit more, as my use of "experiment" more relaxed, compared to yours. 1) Have you got a favourite book about experiment design that you can recommend? [which would treat it with your rigour] 2) Same thing, but about statistics? 3) As in your medicine example, the changes between iterations are small - would it then break the universe if one computed a rough score by putting 3 batches (15 participants) together? The confidence interval would be wide, but wouldn't that at least indicate if the UI "totally sucks", is "okayish" or is "wowamazingtakemymoney!"?
– ralien
yesterday
4) As one of the references cited in the question says, some would report SUS results for N=5; another answer cites a paper that concludes N=12 is an acceptable sample size. Before I started the tests, I had the [overly optimistic] view that if the 1st round turns out well, I'd just continue with it until N=15. However, subsequent changes were made in the UI. So the reasonable course of action is to take the UI of the 3rd iteration and test it with 12 people (assuming Tullis and Stetson know what they're talking about). Is there something obviously wrong with this approach?
– ralien
yesterday
I know enough to be dangerous... I can read and evaluate papers to decide whether I should care about them. But running a study is much different than critiquing a study. It's like the difference between recognizing a good novel vs writing one. I'm guessing for usability studies, you don't have to go through IRB because if you did, you'd be forced to be more "rigorous".
– xiota
yesterday
|
show 4 more comments
How to reconcile the discrepancy? That I can't tell but here is why it might have happened.
The "5 users will find everything wrong with your system" refers to usability problems test subjects will find in your usability tests. Sauro has a great article that goes quite deep in to this "5 users is enough" thingie.
The five user number comes from the number of users you would need to detect approximately 85% of the problems in an interface, given that the probability a user would encounter a problem is about 31%.
Jeff Sauro - Why you only need to test with five users (explained)
SUS on the other hand needs bigger sample size to yield meaningful results. On their 2010 paper A Comparison of Questionnaires for Assessing Website Usability Tullis & Stetson came to a conclusion that you need a sample size of 12 or more for SUS to yield a 100 % accurate analysis of whatever you are analysing.
As one would expect, the accuracy of the analysis increases as the sample size gets larger. With a sample size of only 6, all of the questionnaires yield accuracy of only 30-40%, meaning that 60-70% of the time, at that sample size, you would fail to find a significant difference between the two sites.
So with sample size of 5 you'll get something meaningful in less than 30 % of the time.
For example, SUS jumps up to about 75% accuracy at a sample size of 8, while the others stay down in the 40-55% range. It’s also interesting to note that most of the questionnaires appear to reach an asymptote at a sample size of 12.
Then again, what is an adequate sample size for qualitative analysis. I haven't done any qualitative analysis in the UX field so I can't tell what would be the sufficient sample size. This I found on internet:
Qualitative sample sizes should be large enough to obtain enough data to sufficiently describe the phenomenon of interest and address the research questions.
Oh, really.
For an ethnography, Morse (1994) suggested approximately 30 – 50 participants. For grounded theory, Morse (1994) suggested 30 – 50 interviews, while Creswell (1998) suggested only 20 – 30. For phenomenological studies, Creswell (1998) recommends 5 – 25 and Morse (1994) suggests at least six.
Qualitative Sample Size
+1 Some really good references here. It's a pretty big call to say that SUS will yield a 100% accurate conclusion, since there is always some degree of uncertainty about what you are doing unless you can sample every single users. And even then they are not guaranteed to tell you exactly what they think...
– Michael Lai♦
May 12 at 21:58
add a comment |
Look at your confidence intervals: the "real" score for v1 is somewhere between 58 and 88, while that score for v2 is somewhere between 51 and 79. Most notably, the confidence interval for each score contains the mean for the other score.
What this is telling you is that your sample size is too small. Based on the data you've collected so far, you cannot tell anything about the relative merits of the two versions of the user interface. There are various statistical tests you can use to quantify just how thoroughly you don't know anything, but checking for overlapping confidence intervals is fast and reasonably accurate.
add a comment |
I think the most important thing is to realize that using a combination of qualitative and quantitative analysis methods is going to give you the most complete best picture of what the users are thinking and doing rather than relying on one or the other (which as you can see led to different conclusions). The idea is to use quantitative analysis when you have larger datasets to look for large trends or patterns, and use qualitative methods to drill in on the specific issues. It is also possible to combine various observations in qualitative methods and look for evidence/support of a general trend in the quantitative analysis.
Something that I think is missing in the analysis (in additional to what has already been raised in the other answers) is that there are potentially other contextual information that you haven't used to split some of the responses.
For example, I can imagine that there are different types of user groups mixed together in the analysis and that some of the users have a particular mental model or preferences that influence the more subjective elements of the SUS questionnaire.
However, when you look at the quantitative data (which should probably also be based on some of the analytics collected while they are completing the task rather than just observations) you will probably find less bias or subjectivity as you build up the number of users/participants.
2
Combining methods is fine if they are used appropriately. The problem is OP is trying to use a screening methodology for confirmation. For subgroup analysis to be worthwhile, more participants are needed. With n=5, you're going to be looking at subgroups of n=1-2.
– xiota
2 days ago
@xiota I agree that the methodology chosen needs to be appropriate for the task. In general the assessment of usability can be impacted by some well-known factors (e.g. computer literacy, familiarity with similar technology, etc.) and it would have been interesting to see if things were taken into account, and whether it might help to explain some of the discrepancies or not.
– Michael Lai♦
2 days ago
It's "interesting" to speculate, but the underlying issue is n is far too small to "explain" anything. There might be a "reason" or it could be entirely random. That is what it means when the confidence intervals are so large. To support any explanations for the "discrepancies" would require running experiments with far more participants than the original "studies". It's just not worthwhile.
– xiota
2 days ago
add a comment |
Three scores are higher and two are lower at first. One outlier (Voting 37 instead of 90) would be enough for that. Not to mention that the very close totals mean there is not a lot of difference. Could be that the mood/whether or your participants were different.
Though there might actually be another issue. If we take the results at face value (even if the sample size is too small), the newer version seems to lack the ability to make people love it.
For instance, while tool tips can be helpful and actually make someone find an option faster, I personally find them very annoying, usually. They just distract unnecessarily, especially if the option would be quite easy to find without it. And they make one feel pushed in a certain direction - similar to bad advertisement.
Also, if you omitted information which would be expected, or dumbed down things too much, people would not always be happy about that - especially if they feel that usefulness is reduced. Some people might also dislike the lack of jargon if it belongs there. Or the new colors. And so on.
As I don't know your ui in detail, I can't say more. But if you look at both versions side by side, you might see some of the annoying parts yourself, and can try to find alternatives - small hover texts instead of tool tips one has to acknowledge, for instance.
1
You're trying to explain a statistically insignificant difference. Even if you're correct, it's basically unverifiable guessing. The goal of iterative design is to make incremental improvements, not prove one iteration better than another.
– xiota
2 days ago
I said all that already in my answer...
– Carl Dombrowski
2 days ago
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "102"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fux.stackexchange.com%2fquestions%2f125597%2fwhat-to-do-if-sus-scores-contradict-qualitative-feedback%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
How can this discrepancy be reconciled?
You have divergent results because the number of participants is small and not representative. There is no randomization or blinding to prevent bias. You're also not calculating the relevant stats. (What are the standard deviation, margin of error, confidence intervals, odds ratios, p values, etc?)
Further, you appear to be doing iterative design, not "experiments". There is nothing wrong with iterative design, but the data you collect are likely irrelevant beyond the current design. They cannot be used to meaningfully compare designs against each other. Even if they could, there aren't enough participants to measure the effect of small changes. But you don't need large numbers of users for iterative design. Just enough to identify improvements for the next iteration.
In an experiment, you'd have multiple designs A/B/C... tested in parallel. Participants would be randomized to the designs (as well as task order). Experimenters would not know which design individual participants were using. Experimenters would not observe participants directly. Experimenters would pre-decide what statistical tests are appropriate. They would not begin processing data until after it had all been collected. Etc. If you were testing drugs, your methodology (as well as insufficient participants) would likely prevent FDA approval.
How could one make sense of these results?
You did a t-test and found no significant difference. The "study" is likely underpowered with only five subjects in each group. Even if you had enough numbers to demonstrate significance, the study needs to be redesigned, and the survey has to be checked for reliability and validity.
The System Usability Scale (SUS) is described by its original developer as "quick and dirty". It appears to have been validated as a global assessment, but it's probably not appropriate for comparison. Imagine there were something known as Global Assessment of Functioning that physicians used to evaluate health. Is someone with condition A and GAF 85 "healthier" than someone with condition B and GAF of 80? Does it even make sense to compare A and B this way?
Even if these problems were all addressed, you are still doing iterative design. I would expect differences between successive iterations to be non-significant. Suppose you were testing drugs. Would you expect significantly different results between 100mg and 101mg doses? What about 101mg and 102mg? Etc. (How massive would n need to be to detect such minute differences?)
What to do... ?
Understand that iterative design is not experimentation. The value of small usability reviews is to screen for problems, not confirm success or produce stats.
Stop collecting (or "misusing") quantitative data when you know you won't have the numbers to demonstrate significance. Stop having "expectations", as it is a source of bias that can lead you astray. Redesign experiments to reduce bias.
... it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern.
That is as "expected".
New contributor
I updated the question with some details that I omitted initially: the task order was randomized, and the experiment is still in progress; the intention is to cover more participants in the last round, after considering the obtained feedback. I took some time to reflect on the intermediate results and posted my concerns here. 1. What do you mean by "blind" study in this case? Having the experiment conducted by different examiners? 2. The SUS is a standard survey, can I not trust its reliability and validity if I use it "as is"?
– ralien
May 12 at 0:51
3. Expectations, how can we not have them? What's wrong with expecting that once a button is made bigger, participants will be more likely to notice it? There is bias, that is why we have peer review, statistical tests to keep us in check & forums such as this one. You pointed out that I didn't compute the relevant stats; after taking a few steps in that direction, it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern. I think having expectations is not an issue, as long as you're aware of the traps and take steps to avoid them.
– ralien
May 12 at 1:13
I'd like to tap into your knowledge a bit more, as my use of "experiment" more relaxed, compared to yours. 1) Have you got a favourite book about experiment design that you can recommend? [which would treat it with your rigour] 2) Same thing, but about statistics? 3) As in your medicine example, the changes between iterations are small - would it then break the universe if one computed a rough score by putting 3 batches (15 participants) together? The confidence interval would be wide, but wouldn't that at least indicate if the UI "totally sucks", is "okayish" or is "wowamazingtakemymoney!"?
– ralien
yesterday
4) As one of the references cited in the question says, some would report SUS results for N=5; another answer cites a paper that concludes N=12 is an acceptable sample size. Before I started the tests, I had the [overly optimistic] view that if the 1st round turns out well, I'd just continue with it until N=15. However, subsequent changes were made in the UI. So the reasonable course of action is to take the UI of the 3rd iteration and test it with 12 people (assuming Tullis and Stetson know what they're talking about). Is there something obviously wrong with this approach?
– ralien
yesterday
I know enough to be dangerous... I can read and evaluate papers to decide whether I should care about them. But running a study is much different than critiquing a study. It's like the difference between recognizing a good novel vs writing one. I'm guessing for usability studies, you don't have to go through IRB because if you did, you'd be forced to be more "rigorous".
– xiota
yesterday
|
show 4 more comments
How can this discrepancy be reconciled?
You have divergent results because the number of participants is small and not representative. There is no randomization or blinding to prevent bias. You're also not calculating the relevant stats. (What are the standard deviation, margin of error, confidence intervals, odds ratios, p values, etc?)
Further, you appear to be doing iterative design, not "experiments". There is nothing wrong with iterative design, but the data you collect are likely irrelevant beyond the current design. They cannot be used to meaningfully compare designs against each other. Even if they could, there aren't enough participants to measure the effect of small changes. But you don't need large numbers of users for iterative design. Just enough to identify improvements for the next iteration.
In an experiment, you'd have multiple designs A/B/C... tested in parallel. Participants would be randomized to the designs (as well as task order). Experimenters would not know which design individual participants were using. Experimenters would not observe participants directly. Experimenters would pre-decide what statistical tests are appropriate. They would not begin processing data until after it had all been collected. Etc. If you were testing drugs, your methodology (as well as insufficient participants) would likely prevent FDA approval.
How could one make sense of these results?
You did a t-test and found no significant difference. The "study" is likely underpowered with only five subjects in each group. Even if you had enough numbers to demonstrate significance, the study needs to be redesigned, and the survey has to be checked for reliability and validity.
The System Usability Scale (SUS) is described by its original developer as "quick and dirty". It appears to have been validated as a global assessment, but it's probably not appropriate for comparison. Imagine there were something known as Global Assessment of Functioning that physicians used to evaluate health. Is someone with condition A and GAF 85 "healthier" than someone with condition B and GAF of 80? Does it even make sense to compare A and B this way?
Even if these problems were all addressed, you are still doing iterative design. I would expect differences between successive iterations to be non-significant. Suppose you were testing drugs. Would you expect significantly different results between 100mg and 101mg doses? What about 101mg and 102mg? Etc. (How massive would n need to be to detect such minute differences?)
What to do... ?
Understand that iterative design is not experimentation. The value of small usability reviews is to screen for problems, not confirm success or produce stats.
Stop collecting (or "misusing") quantitative data when you know you won't have the numbers to demonstrate significance. Stop having "expectations", as it is a source of bias that can lead you astray. Redesign experiments to reduce bias.
... it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern.
That is as "expected".
New contributor
I updated the question with some details that I omitted initially: the task order was randomized, and the experiment is still in progress; the intention is to cover more participants in the last round, after considering the obtained feedback. I took some time to reflect on the intermediate results and posted my concerns here. 1. What do you mean by "blind" study in this case? Having the experiment conducted by different examiners? 2. The SUS is a standard survey, can I not trust its reliability and validity if I use it "as is"?
– ralien
May 12 at 0:51
3. Expectations, how can we not have them? What's wrong with expecting that once a button is made bigger, participants will be more likely to notice it? There is bias, that is why we have peer review, statistical tests to keep us in check & forums such as this one. You pointed out that I didn't compute the relevant stats; after taking a few steps in that direction, it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern. I think having expectations is not an issue, as long as you're aware of the traps and take steps to avoid them.
– ralien
May 12 at 1:13
I'd like to tap into your knowledge a bit more, as my use of "experiment" more relaxed, compared to yours. 1) Have you got a favourite book about experiment design that you can recommend? [which would treat it with your rigour] 2) Same thing, but about statistics? 3) As in your medicine example, the changes between iterations are small - would it then break the universe if one computed a rough score by putting 3 batches (15 participants) together? The confidence interval would be wide, but wouldn't that at least indicate if the UI "totally sucks", is "okayish" or is "wowamazingtakemymoney!"?
– ralien
yesterday
4) As one of the references cited in the question says, some would report SUS results for N=5; another answer cites a paper that concludes N=12 is an acceptable sample size. Before I started the tests, I had the [overly optimistic] view that if the 1st round turns out well, I'd just continue with it until N=15. However, subsequent changes were made in the UI. So the reasonable course of action is to take the UI of the 3rd iteration and test it with 12 people (assuming Tullis and Stetson know what they're talking about). Is there something obviously wrong with this approach?
– ralien
yesterday
I know enough to be dangerous... I can read and evaluate papers to decide whether I should care about them. But running a study is much different than critiquing a study. It's like the difference between recognizing a good novel vs writing one. I'm guessing for usability studies, you don't have to go through IRB because if you did, you'd be forced to be more "rigorous".
– xiota
yesterday
|
show 4 more comments
How can this discrepancy be reconciled?
You have divergent results because the number of participants is small and not representative. There is no randomization or blinding to prevent bias. You're also not calculating the relevant stats. (What are the standard deviation, margin of error, confidence intervals, odds ratios, p values, etc?)
Further, you appear to be doing iterative design, not "experiments". There is nothing wrong with iterative design, but the data you collect are likely irrelevant beyond the current design. They cannot be used to meaningfully compare designs against each other. Even if they could, there aren't enough participants to measure the effect of small changes. But you don't need large numbers of users for iterative design. Just enough to identify improvements for the next iteration.
In an experiment, you'd have multiple designs A/B/C... tested in parallel. Participants would be randomized to the designs (as well as task order). Experimenters would not know which design individual participants were using. Experimenters would not observe participants directly. Experimenters would pre-decide what statistical tests are appropriate. They would not begin processing data until after it had all been collected. Etc. If you were testing drugs, your methodology (as well as insufficient participants) would likely prevent FDA approval.
How could one make sense of these results?
You did a t-test and found no significant difference. The "study" is likely underpowered with only five subjects in each group. Even if you had enough numbers to demonstrate significance, the study needs to be redesigned, and the survey has to be checked for reliability and validity.
The System Usability Scale (SUS) is described by its original developer as "quick and dirty". It appears to have been validated as a global assessment, but it's probably not appropriate for comparison. Imagine there were something known as Global Assessment of Functioning that physicians used to evaluate health. Is someone with condition A and GAF 85 "healthier" than someone with condition B and GAF of 80? Does it even make sense to compare A and B this way?
Even if these problems were all addressed, you are still doing iterative design. I would expect differences between successive iterations to be non-significant. Suppose you were testing drugs. Would you expect significantly different results between 100mg and 101mg doses? What about 101mg and 102mg? Etc. (How massive would n need to be to detect such minute differences?)
What to do... ?
Understand that iterative design is not experimentation. The value of small usability reviews is to screen for problems, not confirm success or produce stats.
Stop collecting (or "misusing") quantitative data when you know you won't have the numbers to demonstrate significance. Stop having "expectations", as it is a source of bias that can lead you astray. Redesign experiments to reduce bias.
... it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern.
That is as "expected".
New contributor
How can this discrepancy be reconciled?
You have divergent results because the number of participants is small and not representative. There is no randomization or blinding to prevent bias. You're also not calculating the relevant stats. (What are the standard deviation, margin of error, confidence intervals, odds ratios, p values, etc?)
Further, you appear to be doing iterative design, not "experiments". There is nothing wrong with iterative design, but the data you collect are likely irrelevant beyond the current design. They cannot be used to meaningfully compare designs against each other. Even if they could, there aren't enough participants to measure the effect of small changes. But you don't need large numbers of users for iterative design. Just enough to identify improvements for the next iteration.
In an experiment, you'd have multiple designs A/B/C... tested in parallel. Participants would be randomized to the designs (as well as task order). Experimenters would not know which design individual participants were using. Experimenters would not observe participants directly. Experimenters would pre-decide what statistical tests are appropriate. They would not begin processing data until after it had all been collected. Etc. If you were testing drugs, your methodology (as well as insufficient participants) would likely prevent FDA approval.
How could one make sense of these results?
You did a t-test and found no significant difference. The "study" is likely underpowered with only five subjects in each group. Even if you had enough numbers to demonstrate significance, the study needs to be redesigned, and the survey has to be checked for reliability and validity.
The System Usability Scale (SUS) is described by its original developer as "quick and dirty". It appears to have been validated as a global assessment, but it's probably not appropriate for comparison. Imagine there were something known as Global Assessment of Functioning that physicians used to evaluate health. Is someone with condition A and GAF 85 "healthier" than someone with condition B and GAF of 80? Does it even make sense to compare A and B this way?
Even if these problems were all addressed, you are still doing iterative design. I would expect differences between successive iterations to be non-significant. Suppose you were testing drugs. Would you expect significantly different results between 100mg and 101mg doses? What about 101mg and 102mg? Etc. (How massive would n need to be to detect such minute differences?)
What to do... ?
Understand that iterative design is not experimentation. The value of small usability reviews is to screen for problems, not confirm success or produce stats.
Stop collecting (or "misusing") quantitative data when you know you won't have the numbers to demonstrate significance. Stop having "expectations", as it is a source of bias that can lead you astray. Redesign experiments to reduce bias.
... it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern.
That is as "expected".
New contributor
edited May 12 at 20:53
New contributor
answered May 11 at 16:16
xiotaxiota
831213
831213
New contributor
New contributor
I updated the question with some details that I omitted initially: the task order was randomized, and the experiment is still in progress; the intention is to cover more participants in the last round, after considering the obtained feedback. I took some time to reflect on the intermediate results and posted my concerns here. 1. What do you mean by "blind" study in this case? Having the experiment conducted by different examiners? 2. The SUS is a standard survey, can I not trust its reliability and validity if I use it "as is"?
– ralien
May 12 at 0:51
3. Expectations, how can we not have them? What's wrong with expecting that once a button is made bigger, participants will be more likely to notice it? There is bias, that is why we have peer review, statistical tests to keep us in check & forums such as this one. You pointed out that I didn't compute the relevant stats; after taking a few steps in that direction, it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern. I think having expectations is not an issue, as long as you're aware of the traps and take steps to avoid them.
– ralien
May 12 at 1:13
I'd like to tap into your knowledge a bit more, as my use of "experiment" more relaxed, compared to yours. 1) Have you got a favourite book about experiment design that you can recommend? [which would treat it with your rigour] 2) Same thing, but about statistics? 3) As in your medicine example, the changes between iterations are small - would it then break the universe if one computed a rough score by putting 3 batches (15 participants) together? The confidence interval would be wide, but wouldn't that at least indicate if the UI "totally sucks", is "okayish" or is "wowamazingtakemymoney!"?
– ralien
yesterday
4) As one of the references cited in the question says, some would report SUS results for N=5; another answer cites a paper that concludes N=12 is an acceptable sample size. Before I started the tests, I had the [overly optimistic] view that if the 1st round turns out well, I'd just continue with it until N=15. However, subsequent changes were made in the UI. So the reasonable course of action is to take the UI of the 3rd iteration and test it with 12 people (assuming Tullis and Stetson know what they're talking about). Is there something obviously wrong with this approach?
– ralien
yesterday
I know enough to be dangerous... I can read and evaluate papers to decide whether I should care about them. But running a study is much different than critiquing a study. It's like the difference between recognizing a good novel vs writing one. I'm guessing for usability studies, you don't have to go through IRB because if you did, you'd be forced to be more "rigorous".
– xiota
yesterday
|
show 4 more comments
I updated the question with some details that I omitted initially: the task order was randomized, and the experiment is still in progress; the intention is to cover more participants in the last round, after considering the obtained feedback. I took some time to reflect on the intermediate results and posted my concerns here. 1. What do you mean by "blind" study in this case? Having the experiment conducted by different examiners? 2. The SUS is a standard survey, can I not trust its reliability and validity if I use it "as is"?
– ralien
May 12 at 0:51
3. Expectations, how can we not have them? What's wrong with expecting that once a button is made bigger, participants will be more likely to notice it? There is bias, that is why we have peer review, statistical tests to keep us in check & forums such as this one. You pointed out that I didn't compute the relevant stats; after taking a few steps in that direction, it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern. I think having expectations is not an issue, as long as you're aware of the traps and take steps to avoid them.
– ralien
May 12 at 1:13
I'd like to tap into your knowledge a bit more, as my use of "experiment" more relaxed, compared to yours. 1) Have you got a favourite book about experiment design that you can recommend? [which would treat it with your rigour] 2) Same thing, but about statistics? 3) As in your medicine example, the changes between iterations are small - would it then break the universe if one computed a rough score by putting 3 batches (15 participants) together? The confidence interval would be wide, but wouldn't that at least indicate if the UI "totally sucks", is "okayish" or is "wowamazingtakemymoney!"?
– ralien
yesterday
4) As one of the references cited in the question says, some would report SUS results for N=5; another answer cites a paper that concludes N=12 is an acceptable sample size. Before I started the tests, I had the [overly optimistic] view that if the 1st round turns out well, I'd just continue with it until N=15. However, subsequent changes were made in the UI. So the reasonable course of action is to take the UI of the 3rd iteration and test it with 12 people (assuming Tullis and Stetson know what they're talking about). Is there something obviously wrong with this approach?
– ralien
yesterday
I know enough to be dangerous... I can read and evaluate papers to decide whether I should care about them. But running a study is much different than critiquing a study. It's like the difference between recognizing a good novel vs writing one. I'm guessing for usability studies, you don't have to go through IRB because if you did, you'd be forced to be more "rigorous".
– xiota
yesterday
I updated the question with some details that I omitted initially: the task order was randomized, and the experiment is still in progress; the intention is to cover more participants in the last round, after considering the obtained feedback. I took some time to reflect on the intermediate results and posted my concerns here. 1. What do you mean by "blind" study in this case? Having the experiment conducted by different examiners? 2. The SUS is a standard survey, can I not trust its reliability and validity if I use it "as is"?
– ralien
May 12 at 0:51
I updated the question with some details that I omitted initially: the task order was randomized, and the experiment is still in progress; the intention is to cover more participants in the last round, after considering the obtained feedback. I took some time to reflect on the intermediate results and posted my concerns here. 1. What do you mean by "blind" study in this case? Having the experiment conducted by different examiners? 2. The SUS is a standard survey, can I not trust its reliability and validity if I use it "as is"?
– ralien
May 12 at 0:51
3. Expectations, how can we not have them? What's wrong with expecting that once a button is made bigger, participants will be more likely to notice it? There is bias, that is why we have peer review, statistical tests to keep us in check & forums such as this one. You pointed out that I didn't compute the relevant stats; after taking a few steps in that direction, it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern. I think having expectations is not an issue, as long as you're aware of the traps and take steps to avoid them.
– ralien
May 12 at 1:13
3. Expectations, how can we not have them? What's wrong with expecting that once a button is made bigger, participants will be more likely to notice it? There is bias, that is why we have peer review, statistical tests to keep us in check & forums such as this one. You pointed out that I didn't compute the relevant stats; after taking a few steps in that direction, it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern. I think having expectations is not an issue, as long as you're aware of the traps and take steps to avoid them.
– ralien
May 12 at 1:13
I'd like to tap into your knowledge a bit more, as my use of "experiment" more relaxed, compared to yours. 1) Have you got a favourite book about experiment design that you can recommend? [which would treat it with your rigour] 2) Same thing, but about statistics? 3) As in your medicine example, the changes between iterations are small - would it then break the universe if one computed a rough score by putting 3 batches (15 participants) together? The confidence interval would be wide, but wouldn't that at least indicate if the UI "totally sucks", is "okayish" or is "wowamazingtakemymoney!"?
– ralien
yesterday
I'd like to tap into your knowledge a bit more, as my use of "experiment" more relaxed, compared to yours. 1) Have you got a favourite book about experiment design that you can recommend? [which would treat it with your rigour] 2) Same thing, but about statistics? 3) As in your medicine example, the changes between iterations are small - would it then break the universe if one computed a rough score by putting 3 batches (15 participants) together? The confidence interval would be wide, but wouldn't that at least indicate if the UI "totally sucks", is "okayish" or is "wowamazingtakemymoney!"?
– ralien
yesterday
4) As one of the references cited in the question says, some would report SUS results for N=5; another answer cites a paper that concludes N=12 is an acceptable sample size. Before I started the tests, I had the [overly optimistic] view that if the 1st round turns out well, I'd just continue with it until N=15. However, subsequent changes were made in the UI. So the reasonable course of action is to take the UI of the 3rd iteration and test it with 12 people (assuming Tullis and Stetson know what they're talking about). Is there something obviously wrong with this approach?
– ralien
yesterday
4) As one of the references cited in the question says, some would report SUS results for N=5; another answer cites a paper that concludes N=12 is an acceptable sample size. Before I started the tests, I had the [overly optimistic] view that if the 1st round turns out well, I'd just continue with it until N=15. However, subsequent changes were made in the UI. So the reasonable course of action is to take the UI of the 3rd iteration and test it with 12 people (assuming Tullis and Stetson know what they're talking about). Is there something obviously wrong with this approach?
– ralien
yesterday
I know enough to be dangerous... I can read and evaluate papers to decide whether I should care about them. But running a study is much different than critiquing a study. It's like the difference between recognizing a good novel vs writing one. I'm guessing for usability studies, you don't have to go through IRB because if you did, you'd be forced to be more "rigorous".
– xiota
yesterday
I know enough to be dangerous... I can read and evaluate papers to decide whether I should care about them. But running a study is much different than critiquing a study. It's like the difference between recognizing a good novel vs writing one. I'm guessing for usability studies, you don't have to go through IRB because if you did, you'd be forced to be more "rigorous".
– xiota
yesterday
|
show 4 more comments
How to reconcile the discrepancy? That I can't tell but here is why it might have happened.
The "5 users will find everything wrong with your system" refers to usability problems test subjects will find in your usability tests. Sauro has a great article that goes quite deep in to this "5 users is enough" thingie.
The five user number comes from the number of users you would need to detect approximately 85% of the problems in an interface, given that the probability a user would encounter a problem is about 31%.
Jeff Sauro - Why you only need to test with five users (explained)
SUS on the other hand needs bigger sample size to yield meaningful results. On their 2010 paper A Comparison of Questionnaires for Assessing Website Usability Tullis & Stetson came to a conclusion that you need a sample size of 12 or more for SUS to yield a 100 % accurate analysis of whatever you are analysing.
As one would expect, the accuracy of the analysis increases as the sample size gets larger. With a sample size of only 6, all of the questionnaires yield accuracy of only 30-40%, meaning that 60-70% of the time, at that sample size, you would fail to find a significant difference between the two sites.
So with sample size of 5 you'll get something meaningful in less than 30 % of the time.
For example, SUS jumps up to about 75% accuracy at a sample size of 8, while the others stay down in the 40-55% range. It’s also interesting to note that most of the questionnaires appear to reach an asymptote at a sample size of 12.
Then again, what is an adequate sample size for qualitative analysis. I haven't done any qualitative analysis in the UX field so I can't tell what would be the sufficient sample size. This I found on internet:
Qualitative sample sizes should be large enough to obtain enough data to sufficiently describe the phenomenon of interest and address the research questions.
Oh, really.
For an ethnography, Morse (1994) suggested approximately 30 – 50 participants. For grounded theory, Morse (1994) suggested 30 – 50 interviews, while Creswell (1998) suggested only 20 – 30. For phenomenological studies, Creswell (1998) recommends 5 – 25 and Morse (1994) suggests at least six.
Qualitative Sample Size
+1 Some really good references here. It's a pretty big call to say that SUS will yield a 100% accurate conclusion, since there is always some degree of uncertainty about what you are doing unless you can sample every single users. And even then they are not guaranteed to tell you exactly what they think...
– Michael Lai♦
May 12 at 21:58
add a comment |
How to reconcile the discrepancy? That I can't tell but here is why it might have happened.
The "5 users will find everything wrong with your system" refers to usability problems test subjects will find in your usability tests. Sauro has a great article that goes quite deep in to this "5 users is enough" thingie.
The five user number comes from the number of users you would need to detect approximately 85% of the problems in an interface, given that the probability a user would encounter a problem is about 31%.
Jeff Sauro - Why you only need to test with five users (explained)
SUS on the other hand needs bigger sample size to yield meaningful results. On their 2010 paper A Comparison of Questionnaires for Assessing Website Usability Tullis & Stetson came to a conclusion that you need a sample size of 12 or more for SUS to yield a 100 % accurate analysis of whatever you are analysing.
As one would expect, the accuracy of the analysis increases as the sample size gets larger. With a sample size of only 6, all of the questionnaires yield accuracy of only 30-40%, meaning that 60-70% of the time, at that sample size, you would fail to find a significant difference between the two sites.
So with sample size of 5 you'll get something meaningful in less than 30 % of the time.
For example, SUS jumps up to about 75% accuracy at a sample size of 8, while the others stay down in the 40-55% range. It’s also interesting to note that most of the questionnaires appear to reach an asymptote at a sample size of 12.
Then again, what is an adequate sample size for qualitative analysis. I haven't done any qualitative analysis in the UX field so I can't tell what would be the sufficient sample size. This I found on internet:
Qualitative sample sizes should be large enough to obtain enough data to sufficiently describe the phenomenon of interest and address the research questions.
Oh, really.
For an ethnography, Morse (1994) suggested approximately 30 – 50 participants. For grounded theory, Morse (1994) suggested 30 – 50 interviews, while Creswell (1998) suggested only 20 – 30. For phenomenological studies, Creswell (1998) recommends 5 – 25 and Morse (1994) suggests at least six.
Qualitative Sample Size
+1 Some really good references here. It's a pretty big call to say that SUS will yield a 100% accurate conclusion, since there is always some degree of uncertainty about what you are doing unless you can sample every single users. And even then they are not guaranteed to tell you exactly what they think...
– Michael Lai♦
May 12 at 21:58
add a comment |
How to reconcile the discrepancy? That I can't tell but here is why it might have happened.
The "5 users will find everything wrong with your system" refers to usability problems test subjects will find in your usability tests. Sauro has a great article that goes quite deep in to this "5 users is enough" thingie.
The five user number comes from the number of users you would need to detect approximately 85% of the problems in an interface, given that the probability a user would encounter a problem is about 31%.
Jeff Sauro - Why you only need to test with five users (explained)
SUS on the other hand needs bigger sample size to yield meaningful results. On their 2010 paper A Comparison of Questionnaires for Assessing Website Usability Tullis & Stetson came to a conclusion that you need a sample size of 12 or more for SUS to yield a 100 % accurate analysis of whatever you are analysing.
As one would expect, the accuracy of the analysis increases as the sample size gets larger. With a sample size of only 6, all of the questionnaires yield accuracy of only 30-40%, meaning that 60-70% of the time, at that sample size, you would fail to find a significant difference between the two sites.
So with sample size of 5 you'll get something meaningful in less than 30 % of the time.
For example, SUS jumps up to about 75% accuracy at a sample size of 8, while the others stay down in the 40-55% range. It’s also interesting to note that most of the questionnaires appear to reach an asymptote at a sample size of 12.
Then again, what is an adequate sample size for qualitative analysis. I haven't done any qualitative analysis in the UX field so I can't tell what would be the sufficient sample size. This I found on internet:
Qualitative sample sizes should be large enough to obtain enough data to sufficiently describe the phenomenon of interest and address the research questions.
Oh, really.
For an ethnography, Morse (1994) suggested approximately 30 – 50 participants. For grounded theory, Morse (1994) suggested 30 – 50 interviews, while Creswell (1998) suggested only 20 – 30. For phenomenological studies, Creswell (1998) recommends 5 – 25 and Morse (1994) suggests at least six.
Qualitative Sample Size
How to reconcile the discrepancy? That I can't tell but here is why it might have happened.
The "5 users will find everything wrong with your system" refers to usability problems test subjects will find in your usability tests. Sauro has a great article that goes quite deep in to this "5 users is enough" thingie.
The five user number comes from the number of users you would need to detect approximately 85% of the problems in an interface, given that the probability a user would encounter a problem is about 31%.
Jeff Sauro - Why you only need to test with five users (explained)
SUS on the other hand needs bigger sample size to yield meaningful results. On their 2010 paper A Comparison of Questionnaires for Assessing Website Usability Tullis & Stetson came to a conclusion that you need a sample size of 12 or more for SUS to yield a 100 % accurate analysis of whatever you are analysing.
As one would expect, the accuracy of the analysis increases as the sample size gets larger. With a sample size of only 6, all of the questionnaires yield accuracy of only 30-40%, meaning that 60-70% of the time, at that sample size, you would fail to find a significant difference between the two sites.
So with sample size of 5 you'll get something meaningful in less than 30 % of the time.
For example, SUS jumps up to about 75% accuracy at a sample size of 8, while the others stay down in the 40-55% range. It’s also interesting to note that most of the questionnaires appear to reach an asymptote at a sample size of 12.
Then again, what is an adequate sample size for qualitative analysis. I haven't done any qualitative analysis in the UX field so I can't tell what would be the sufficient sample size. This I found on internet:
Qualitative sample sizes should be large enough to obtain enough data to sufficiently describe the phenomenon of interest and address the research questions.
Oh, really.
For an ethnography, Morse (1994) suggested approximately 30 – 50 participants. For grounded theory, Morse (1994) suggested 30 – 50 interviews, while Creswell (1998) suggested only 20 – 30. For phenomenological studies, Creswell (1998) recommends 5 – 25 and Morse (1994) suggests at least six.
Qualitative Sample Size
answered May 11 at 15:55
locationunknownlocationunknown
3,78151531
3,78151531
+1 Some really good references here. It's a pretty big call to say that SUS will yield a 100% accurate conclusion, since there is always some degree of uncertainty about what you are doing unless you can sample every single users. And even then they are not guaranteed to tell you exactly what they think...
– Michael Lai♦
May 12 at 21:58
add a comment |
+1 Some really good references here. It's a pretty big call to say that SUS will yield a 100% accurate conclusion, since there is always some degree of uncertainty about what you are doing unless you can sample every single users. And even then they are not guaranteed to tell you exactly what they think...
– Michael Lai♦
May 12 at 21:58
+1 Some really good references here. It's a pretty big call to say that SUS will yield a 100% accurate conclusion, since there is always some degree of uncertainty about what you are doing unless you can sample every single users. And even then they are not guaranteed to tell you exactly what they think...
– Michael Lai♦
May 12 at 21:58
+1 Some really good references here. It's a pretty big call to say that SUS will yield a 100% accurate conclusion, since there is always some degree of uncertainty about what you are doing unless you can sample every single users. And even then they are not guaranteed to tell you exactly what they think...
– Michael Lai♦
May 12 at 21:58
add a comment |
Look at your confidence intervals: the "real" score for v1 is somewhere between 58 and 88, while that score for v2 is somewhere between 51 and 79. Most notably, the confidence interval for each score contains the mean for the other score.
What this is telling you is that your sample size is too small. Based on the data you've collected so far, you cannot tell anything about the relative merits of the two versions of the user interface. There are various statistical tests you can use to quantify just how thoroughly you don't know anything, but checking for overlapping confidence intervals is fast and reasonably accurate.
add a comment |
Look at your confidence intervals: the "real" score for v1 is somewhere between 58 and 88, while that score for v2 is somewhere between 51 and 79. Most notably, the confidence interval for each score contains the mean for the other score.
What this is telling you is that your sample size is too small. Based on the data you've collected so far, you cannot tell anything about the relative merits of the two versions of the user interface. There are various statistical tests you can use to quantify just how thoroughly you don't know anything, but checking for overlapping confidence intervals is fast and reasonably accurate.
add a comment |
Look at your confidence intervals: the "real" score for v1 is somewhere between 58 and 88, while that score for v2 is somewhere between 51 and 79. Most notably, the confidence interval for each score contains the mean for the other score.
What this is telling you is that your sample size is too small. Based on the data you've collected so far, you cannot tell anything about the relative merits of the two versions of the user interface. There are various statistical tests you can use to quantify just how thoroughly you don't know anything, but checking for overlapping confidence intervals is fast and reasonably accurate.
Look at your confidence intervals: the "real" score for v1 is somewhere between 58 and 88, while that score for v2 is somewhere between 51 and 79. Most notably, the confidence interval for each score contains the mean for the other score.
What this is telling you is that your sample size is too small. Based on the data you've collected so far, you cannot tell anything about the relative merits of the two versions of the user interface. There are various statistical tests you can use to quantify just how thoroughly you don't know anything, but checking for overlapping confidence intervals is fast and reasonably accurate.
answered May 12 at 4:32
MarkMark
1,434613
1,434613
add a comment |
add a comment |
I think the most important thing is to realize that using a combination of qualitative and quantitative analysis methods is going to give you the most complete best picture of what the users are thinking and doing rather than relying on one or the other (which as you can see led to different conclusions). The idea is to use quantitative analysis when you have larger datasets to look for large trends or patterns, and use qualitative methods to drill in on the specific issues. It is also possible to combine various observations in qualitative methods and look for evidence/support of a general trend in the quantitative analysis.
Something that I think is missing in the analysis (in additional to what has already been raised in the other answers) is that there are potentially other contextual information that you haven't used to split some of the responses.
For example, I can imagine that there are different types of user groups mixed together in the analysis and that some of the users have a particular mental model or preferences that influence the more subjective elements of the SUS questionnaire.
However, when you look at the quantitative data (which should probably also be based on some of the analytics collected while they are completing the task rather than just observations) you will probably find less bias or subjectivity as you build up the number of users/participants.
2
Combining methods is fine if they are used appropriately. The problem is OP is trying to use a screening methodology for confirmation. For subgroup analysis to be worthwhile, more participants are needed. With n=5, you're going to be looking at subgroups of n=1-2.
– xiota
2 days ago
@xiota I agree that the methodology chosen needs to be appropriate for the task. In general the assessment of usability can be impacted by some well-known factors (e.g. computer literacy, familiarity with similar technology, etc.) and it would have been interesting to see if things were taken into account, and whether it might help to explain some of the discrepancies or not.
– Michael Lai♦
2 days ago
It's "interesting" to speculate, but the underlying issue is n is far too small to "explain" anything. There might be a "reason" or it could be entirely random. That is what it means when the confidence intervals are so large. To support any explanations for the "discrepancies" would require running experiments with far more participants than the original "studies". It's just not worthwhile.
– xiota
2 days ago
add a comment |
I think the most important thing is to realize that using a combination of qualitative and quantitative analysis methods is going to give you the most complete best picture of what the users are thinking and doing rather than relying on one or the other (which as you can see led to different conclusions). The idea is to use quantitative analysis when you have larger datasets to look for large trends or patterns, and use qualitative methods to drill in on the specific issues. It is also possible to combine various observations in qualitative methods and look for evidence/support of a general trend in the quantitative analysis.
Something that I think is missing in the analysis (in additional to what has already been raised in the other answers) is that there are potentially other contextual information that you haven't used to split some of the responses.
For example, I can imagine that there are different types of user groups mixed together in the analysis and that some of the users have a particular mental model or preferences that influence the more subjective elements of the SUS questionnaire.
However, when you look at the quantitative data (which should probably also be based on some of the analytics collected while they are completing the task rather than just observations) you will probably find less bias or subjectivity as you build up the number of users/participants.
2
Combining methods is fine if they are used appropriately. The problem is OP is trying to use a screening methodology for confirmation. For subgroup analysis to be worthwhile, more participants are needed. With n=5, you're going to be looking at subgroups of n=1-2.
– xiota
2 days ago
@xiota I agree that the methodology chosen needs to be appropriate for the task. In general the assessment of usability can be impacted by some well-known factors (e.g. computer literacy, familiarity with similar technology, etc.) and it would have been interesting to see if things were taken into account, and whether it might help to explain some of the discrepancies or not.
– Michael Lai♦
2 days ago
It's "interesting" to speculate, but the underlying issue is n is far too small to "explain" anything. There might be a "reason" or it could be entirely random. That is what it means when the confidence intervals are so large. To support any explanations for the "discrepancies" would require running experiments with far more participants than the original "studies". It's just not worthwhile.
– xiota
2 days ago
add a comment |
I think the most important thing is to realize that using a combination of qualitative and quantitative analysis methods is going to give you the most complete best picture of what the users are thinking and doing rather than relying on one or the other (which as you can see led to different conclusions). The idea is to use quantitative analysis when you have larger datasets to look for large trends or patterns, and use qualitative methods to drill in on the specific issues. It is also possible to combine various observations in qualitative methods and look for evidence/support of a general trend in the quantitative analysis.
Something that I think is missing in the analysis (in additional to what has already been raised in the other answers) is that there are potentially other contextual information that you haven't used to split some of the responses.
For example, I can imagine that there are different types of user groups mixed together in the analysis and that some of the users have a particular mental model or preferences that influence the more subjective elements of the SUS questionnaire.
However, when you look at the quantitative data (which should probably also be based on some of the analytics collected while they are completing the task rather than just observations) you will probably find less bias or subjectivity as you build up the number of users/participants.
I think the most important thing is to realize that using a combination of qualitative and quantitative analysis methods is going to give you the most complete best picture of what the users are thinking and doing rather than relying on one or the other (which as you can see led to different conclusions). The idea is to use quantitative analysis when you have larger datasets to look for large trends or patterns, and use qualitative methods to drill in on the specific issues. It is also possible to combine various observations in qualitative methods and look for evidence/support of a general trend in the quantitative analysis.
Something that I think is missing in the analysis (in additional to what has already been raised in the other answers) is that there are potentially other contextual information that you haven't used to split some of the responses.
For example, I can imagine that there are different types of user groups mixed together in the analysis and that some of the users have a particular mental model or preferences that influence the more subjective elements of the SUS questionnaire.
However, when you look at the quantitative data (which should probably also be based on some of the analytics collected while they are completing the task rather than just observations) you will probably find less bias or subjectivity as you build up the number of users/participants.
answered May 12 at 22:07
Michael Lai♦Michael Lai
15.1k1165144
15.1k1165144
2
Combining methods is fine if they are used appropriately. The problem is OP is trying to use a screening methodology for confirmation. For subgroup analysis to be worthwhile, more participants are needed. With n=5, you're going to be looking at subgroups of n=1-2.
– xiota
2 days ago
@xiota I agree that the methodology chosen needs to be appropriate for the task. In general the assessment of usability can be impacted by some well-known factors (e.g. computer literacy, familiarity with similar technology, etc.) and it would have been interesting to see if things were taken into account, and whether it might help to explain some of the discrepancies or not.
– Michael Lai♦
2 days ago
It's "interesting" to speculate, but the underlying issue is n is far too small to "explain" anything. There might be a "reason" or it could be entirely random. That is what it means when the confidence intervals are so large. To support any explanations for the "discrepancies" would require running experiments with far more participants than the original "studies". It's just not worthwhile.
– xiota
2 days ago
add a comment |
2
Combining methods is fine if they are used appropriately. The problem is OP is trying to use a screening methodology for confirmation. For subgroup analysis to be worthwhile, more participants are needed. With n=5, you're going to be looking at subgroups of n=1-2.
– xiota
2 days ago
@xiota I agree that the methodology chosen needs to be appropriate for the task. In general the assessment of usability can be impacted by some well-known factors (e.g. computer literacy, familiarity with similar technology, etc.) and it would have been interesting to see if things were taken into account, and whether it might help to explain some of the discrepancies or not.
– Michael Lai♦
2 days ago
It's "interesting" to speculate, but the underlying issue is n is far too small to "explain" anything. There might be a "reason" or it could be entirely random. That is what it means when the confidence intervals are so large. To support any explanations for the "discrepancies" would require running experiments with far more participants than the original "studies". It's just not worthwhile.
– xiota
2 days ago
2
2
Combining methods is fine if they are used appropriately. The problem is OP is trying to use a screening methodology for confirmation. For subgroup analysis to be worthwhile, more participants are needed. With n=5, you're going to be looking at subgroups of n=1-2.
– xiota
2 days ago
Combining methods is fine if they are used appropriately. The problem is OP is trying to use a screening methodology for confirmation. For subgroup analysis to be worthwhile, more participants are needed. With n=5, you're going to be looking at subgroups of n=1-2.
– xiota
2 days ago
@xiota I agree that the methodology chosen needs to be appropriate for the task. In general the assessment of usability can be impacted by some well-known factors (e.g. computer literacy, familiarity with similar technology, etc.) and it would have been interesting to see if things were taken into account, and whether it might help to explain some of the discrepancies or not.
– Michael Lai♦
2 days ago
@xiota I agree that the methodology chosen needs to be appropriate for the task. In general the assessment of usability can be impacted by some well-known factors (e.g. computer literacy, familiarity with similar technology, etc.) and it would have been interesting to see if things were taken into account, and whether it might help to explain some of the discrepancies or not.
– Michael Lai♦
2 days ago
It's "interesting" to speculate, but the underlying issue is n is far too small to "explain" anything. There might be a "reason" or it could be entirely random. That is what it means when the confidence intervals are so large. To support any explanations for the "discrepancies" would require running experiments with far more participants than the original "studies". It's just not worthwhile.
– xiota
2 days ago
It's "interesting" to speculate, but the underlying issue is n is far too small to "explain" anything. There might be a "reason" or it could be entirely random. That is what it means when the confidence intervals are so large. To support any explanations for the "discrepancies" would require running experiments with far more participants than the original "studies". It's just not worthwhile.
– xiota
2 days ago
add a comment |
Three scores are higher and two are lower at first. One outlier (Voting 37 instead of 90) would be enough for that. Not to mention that the very close totals mean there is not a lot of difference. Could be that the mood/whether or your participants were different.
Though there might actually be another issue. If we take the results at face value (even if the sample size is too small), the newer version seems to lack the ability to make people love it.
For instance, while tool tips can be helpful and actually make someone find an option faster, I personally find them very annoying, usually. They just distract unnecessarily, especially if the option would be quite easy to find without it. And they make one feel pushed in a certain direction - similar to bad advertisement.
Also, if you omitted information which would be expected, or dumbed down things too much, people would not always be happy about that - especially if they feel that usefulness is reduced. Some people might also dislike the lack of jargon if it belongs there. Or the new colors. And so on.
As I don't know your ui in detail, I can't say more. But if you look at both versions side by side, you might see some of the annoying parts yourself, and can try to find alternatives - small hover texts instead of tool tips one has to acknowledge, for instance.
1
You're trying to explain a statistically insignificant difference. Even if you're correct, it's basically unverifiable guessing. The goal of iterative design is to make incremental improvements, not prove one iteration better than another.
– xiota
2 days ago
I said all that already in my answer...
– Carl Dombrowski
2 days ago
add a comment |
Three scores are higher and two are lower at first. One outlier (Voting 37 instead of 90) would be enough for that. Not to mention that the very close totals mean there is not a lot of difference. Could be that the mood/whether or your participants were different.
Though there might actually be another issue. If we take the results at face value (even if the sample size is too small), the newer version seems to lack the ability to make people love it.
For instance, while tool tips can be helpful and actually make someone find an option faster, I personally find them very annoying, usually. They just distract unnecessarily, especially if the option would be quite easy to find without it. And they make one feel pushed in a certain direction - similar to bad advertisement.
Also, if you omitted information which would be expected, or dumbed down things too much, people would not always be happy about that - especially if they feel that usefulness is reduced. Some people might also dislike the lack of jargon if it belongs there. Or the new colors. And so on.
As I don't know your ui in detail, I can't say more. But if you look at both versions side by side, you might see some of the annoying parts yourself, and can try to find alternatives - small hover texts instead of tool tips one has to acknowledge, for instance.
1
You're trying to explain a statistically insignificant difference. Even if you're correct, it's basically unverifiable guessing. The goal of iterative design is to make incremental improvements, not prove one iteration better than another.
– xiota
2 days ago
I said all that already in my answer...
– Carl Dombrowski
2 days ago
add a comment |
Three scores are higher and two are lower at first. One outlier (Voting 37 instead of 90) would be enough for that. Not to mention that the very close totals mean there is not a lot of difference. Could be that the mood/whether or your participants were different.
Though there might actually be another issue. If we take the results at face value (even if the sample size is too small), the newer version seems to lack the ability to make people love it.
For instance, while tool tips can be helpful and actually make someone find an option faster, I personally find them very annoying, usually. They just distract unnecessarily, especially if the option would be quite easy to find without it. And they make one feel pushed in a certain direction - similar to bad advertisement.
Also, if you omitted information which would be expected, or dumbed down things too much, people would not always be happy about that - especially if they feel that usefulness is reduced. Some people might also dislike the lack of jargon if it belongs there. Or the new colors. And so on.
As I don't know your ui in detail, I can't say more. But if you look at both versions side by side, you might see some of the annoying parts yourself, and can try to find alternatives - small hover texts instead of tool tips one has to acknowledge, for instance.
Three scores are higher and two are lower at first. One outlier (Voting 37 instead of 90) would be enough for that. Not to mention that the very close totals mean there is not a lot of difference. Could be that the mood/whether or your participants were different.
Though there might actually be another issue. If we take the results at face value (even if the sample size is too small), the newer version seems to lack the ability to make people love it.
For instance, while tool tips can be helpful and actually make someone find an option faster, I personally find them very annoying, usually. They just distract unnecessarily, especially if the option would be quite easy to find without it. And they make one feel pushed in a certain direction - similar to bad advertisement.
Also, if you omitted information which would be expected, or dumbed down things too much, people would not always be happy about that - especially if they feel that usefulness is reduced. Some people might also dislike the lack of jargon if it belongs there. Or the new colors. And so on.
As I don't know your ui in detail, I can't say more. But if you look at both versions side by side, you might see some of the annoying parts yourself, and can try to find alternatives - small hover texts instead of tool tips one has to acknowledge, for instance.
answered May 12 at 22:58
Carl DombrowskiCarl Dombrowski
1342
1342
1
You're trying to explain a statistically insignificant difference. Even if you're correct, it's basically unverifiable guessing. The goal of iterative design is to make incremental improvements, not prove one iteration better than another.
– xiota
2 days ago
I said all that already in my answer...
– Carl Dombrowski
2 days ago
add a comment |
1
You're trying to explain a statistically insignificant difference. Even if you're correct, it's basically unverifiable guessing. The goal of iterative design is to make incremental improvements, not prove one iteration better than another.
– xiota
2 days ago
I said all that already in my answer...
– Carl Dombrowski
2 days ago
1
1
You're trying to explain a statistically insignificant difference. Even if you're correct, it's basically unverifiable guessing. The goal of iterative design is to make incremental improvements, not prove one iteration better than another.
– xiota
2 days ago
You're trying to explain a statistically insignificant difference. Even if you're correct, it's basically unverifiable guessing. The goal of iterative design is to make incremental improvements, not prove one iteration better than another.
– xiota
2 days ago
I said all that already in my answer...
– Carl Dombrowski
2 days ago
I said all that already in my answer...
– Carl Dombrowski
2 days ago
add a comment |
Thanks for contributing an answer to User Experience Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fux.stackexchange.com%2fquestions%2f125597%2fwhat-to-do-if-sus-scores-contradict-qualitative-feedback%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4
Besides what the other people have already said, it's quite obvious that the result of v2 is massively set off by the outlier 37.5. This is a common issue when applying qualitative methods on a sample that's too small, if even one person produces extreme values, it automatically shifts the whole distribution.
– Big_Chair
May 11 at 18:52
Indeed, Nielsen wrote about this matter too, stating "For now, my best conclusion is that slow outliers are caused by bad luck rather than by a persistent property of the users in question.". So this is not uncommon; still, I can't just remove this result- that would be akin to omitting "inconvenient evidence". What would be an objective way to present these results?
– ralien
May 11 at 19:17
I am not 100% sure, but one approach could be to omit (or put less focus on) the calculated average value and instead treat each result individually, in conjunction with their qualitative results. I.e. "P1's feedback was ... with an accompanying score of 72.5" and only show the average at the end.
– Big_Chair
May 11 at 21:45
Or present them normally as they are but show them in a diagram as well and put strong focus on the outlier, while also presenting an alternative average value without it. I'd go with this type of diagram over a bar chart since the difference is more noticeable (might also just be my personal preference).
– Big_Chair
May 11 at 21:48
It may be more instructive to look at the median, especially when dealing with small samples of humans. Here you have 67.5 vs 70 for the medians, which are essentially the same.
– Chris H
2 days ago