What to do if SUS scores contradict qualitative feedback?Using the same participantsSlide out panel or new pageSUS Scores and DoubtsCan SUS scores be used for comparison between different systems?Comparing the WAMMI and SUS scalesShould different users always be used for successive System Usability Scale testing?Can I test different devices when I'll only have a sample of around 10 participants?How much margin (if any) should I allow for in calculating SUS?Example SUS scores / proof of outcome?Should I tell interviewees remaining time for tests?Best way to nudge stubborn usability testers?SUS evaluation on the same system but different sections and participants

Why use a retrograde orbit?

Why would you put your input amplifier in front of your filtering for and ECG signal?

Why is the marginal distribution/marginal probability described as "marginal"?

Why can't I share a one use code with anyone else?

What would a Dragon have to exhale to cause rain?

Rushed passport - does my reason qualify?

Is there an academic word that means "to split hairs over"?

How could it be that 80% of townspeople were farmers during the Edo period in Japan?

Why are there five extra turns in tournament Magic?

Is there a method to separate iron from mercury?

Is there any deeper thematic meaning to the white horse that Arya finds in The Bells (S08E05)?

Physically unpleasant work environment

Is it standard for US-based universities to consider the ethnicity of an applicant during PhD admissions?

Square spiral in Mathematica

How come Arya Stark didn't burn in Game of Thrones Season 8 Episode 5

How does the Heat Metal spell interact with a follow-up Frostbite spell?

Could a space colony 1g from the sun work?

Would life always name the light from their sun "white"

AD: OU for system administrator accounts

How does this piece of code determine array size without using sizeof( )?

Is Precocious Apprentice enough for Mystic Theurge?

Canadian citizen who is presently in litigation with a US-based company

Polynomial division: Is this trick obvious?

Promotion comes with unexpected 24/7/365 on-call



What to do if SUS scores contradict qualitative feedback?


Using the same participantsSlide out panel or new pageSUS Scores and DoubtsCan SUS scores be used for comparison between different systems?Comparing the WAMMI and SUS scalesShould different users always be used for successive System Usability Scale testing?Can I test different devices when I'll only have a sample of around 10 participants?How much margin (if any) should I allow for in calculating SUS?Example SUS scores / proof of outcome?Should I tell interviewees remaining time for tests?Best way to nudge stubborn usability testers?SUS evaluation on the same system but different sections and participants






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








13















TL;DR: qualitative data collected in a usability experiment seems to contradict the quantitative results of the SUS questionnaire. How can this discrepancy be reconciled?



The following experiment is conducted to evaluate the usability of a web-interface:



  1. Observe participants as they think aloud while using the interface to accomplish 8 tasks (the task order is randomized, this takes around 30 minutes)

  2. Give them a SUS form to fill out

  3. After they completed the survey, ask several follow-up questions to get more feedback (another 30 minutes)

So far, the experiment was conducted with 5 participants, then the UI was adjusted to address the found issues. A second round of 5 participants were then invited to go through the same steps.



It is planned to perform another round, with at least 5 participants (to obtain a sufficiently large sample). The current results are summarized below:



SUS survey results (confidence at 95%)



You can see that the v2 score is lower than v1.



These findings are puzzling, because:



  • the qualitative feedback I got from participants was more positive in v2


  • the changes between v1 and v2 were not ground-breaking, e.g.:



    • added tooltips to widgets

    • increased the contrast to make the active tab more prominent

    • changed wording to avoid technical jargon

    • shortened text


  • nevertheless, these tweaks did polish the "rough edges" of v1, as it was clear from the observations that there was less friction while participants used the site


In other words, the changes were small incremental steps that should have yielded small improvements. The qualitative results match the expectations, while the quantitative data do not.



Since the overall average of 69 falls in line with the average SUS score of 68, it seems that nothing unusual has happened and we're testing "just an average interface". However, I am not sure how to reconcile the fact that the numbers contradict the humane feedback.



Nielsen says that qualitative feedback is more valuable and numbers can lead you astray. On the other hand, Sauro says that they do report SUS scores based on a sample of 5 users (as well as looks at the history of sample sizes, concluding that a minimum of 5 is reasonable).



At the same time, a t-test says that the differences between the scores of v1 and v2 are not statistically significant.



How could one make sense of these results?




Thank you all for your comments, answers, and time. Although there is only one accepted answer, all the input is helpful. It enabled me to take a sober look at the data, and reduce the "jumptoconclusionness" factor to a lower level.



A note for future archaeologists: the question was edited to include details and statistics mentioned in the comments. It might help to look at the edit history to see the starting point and understand how it ended up like this.










share|improve this question



















  • 4





    Besides what the other people have already said, it's quite obvious that the result of v2 is massively set off by the outlier 37.5. This is a common issue when applying qualitative methods on a sample that's too small, if even one person produces extreme values, it automatically shifts the whole distribution.

    – Big_Chair
    May 11 at 18:52











  • Indeed, Nielsen wrote about this matter too, stating "For now, my best conclusion is that slow outliers are caused by bad luck rather than by a persistent property of the users in question.". So this is not uncommon; still, I can't just remove this result- that would be akin to omitting "inconvenient evidence". What would be an objective way to present these results?

    – ralien
    May 11 at 19:17











  • I am not 100% sure, but one approach could be to omit (or put less focus on) the calculated average value and instead treat each result individually, in conjunction with their qualitative results. I.e. "P1's feedback was ... with an accompanying score of 72.5" and only show the average at the end.

    – Big_Chair
    May 11 at 21:45











  • Or present them normally as they are but show them in a diagram as well and put strong focus on the outlier, while also presenting an alternative average value without it. I'd go with this type of diagram over a bar chart since the difference is more noticeable (might also just be my personal preference).

    – Big_Chair
    May 11 at 21:48












  • It may be more instructive to look at the median, especially when dealing with small samples of humans. Here you have 67.5 vs 70 for the medians, which are essentially the same.

    – Chris H
    2 days ago

















13















TL;DR: qualitative data collected in a usability experiment seems to contradict the quantitative results of the SUS questionnaire. How can this discrepancy be reconciled?



The following experiment is conducted to evaluate the usability of a web-interface:



  1. Observe participants as they think aloud while using the interface to accomplish 8 tasks (the task order is randomized, this takes around 30 minutes)

  2. Give them a SUS form to fill out

  3. After they completed the survey, ask several follow-up questions to get more feedback (another 30 minutes)

So far, the experiment was conducted with 5 participants, then the UI was adjusted to address the found issues. A second round of 5 participants were then invited to go through the same steps.



It is planned to perform another round, with at least 5 participants (to obtain a sufficiently large sample). The current results are summarized below:



SUS survey results (confidence at 95%)



You can see that the v2 score is lower than v1.



These findings are puzzling, because:



  • the qualitative feedback I got from participants was more positive in v2


  • the changes between v1 and v2 were not ground-breaking, e.g.:



    • added tooltips to widgets

    • increased the contrast to make the active tab more prominent

    • changed wording to avoid technical jargon

    • shortened text


  • nevertheless, these tweaks did polish the "rough edges" of v1, as it was clear from the observations that there was less friction while participants used the site


In other words, the changes were small incremental steps that should have yielded small improvements. The qualitative results match the expectations, while the quantitative data do not.



Since the overall average of 69 falls in line with the average SUS score of 68, it seems that nothing unusual has happened and we're testing "just an average interface". However, I am not sure how to reconcile the fact that the numbers contradict the humane feedback.



Nielsen says that qualitative feedback is more valuable and numbers can lead you astray. On the other hand, Sauro says that they do report SUS scores based on a sample of 5 users (as well as looks at the history of sample sizes, concluding that a minimum of 5 is reasonable).



At the same time, a t-test says that the differences between the scores of v1 and v2 are not statistically significant.



How could one make sense of these results?




Thank you all for your comments, answers, and time. Although there is only one accepted answer, all the input is helpful. It enabled me to take a sober look at the data, and reduce the "jumptoconclusionness" factor to a lower level.



A note for future archaeologists: the question was edited to include details and statistics mentioned in the comments. It might help to look at the edit history to see the starting point and understand how it ended up like this.










share|improve this question



















  • 4





    Besides what the other people have already said, it's quite obvious that the result of v2 is massively set off by the outlier 37.5. This is a common issue when applying qualitative methods on a sample that's too small, if even one person produces extreme values, it automatically shifts the whole distribution.

    – Big_Chair
    May 11 at 18:52











  • Indeed, Nielsen wrote about this matter too, stating "For now, my best conclusion is that slow outliers are caused by bad luck rather than by a persistent property of the users in question.". So this is not uncommon; still, I can't just remove this result- that would be akin to omitting "inconvenient evidence". What would be an objective way to present these results?

    – ralien
    May 11 at 19:17











  • I am not 100% sure, but one approach could be to omit (or put less focus on) the calculated average value and instead treat each result individually, in conjunction with their qualitative results. I.e. "P1's feedback was ... with an accompanying score of 72.5" and only show the average at the end.

    – Big_Chair
    May 11 at 21:45











  • Or present them normally as they are but show them in a diagram as well and put strong focus on the outlier, while also presenting an alternative average value without it. I'd go with this type of diagram over a bar chart since the difference is more noticeable (might also just be my personal preference).

    – Big_Chair
    May 11 at 21:48












  • It may be more instructive to look at the median, especially when dealing with small samples of humans. Here you have 67.5 vs 70 for the medians, which are essentially the same.

    – Chris H
    2 days ago













13












13








13


5






TL;DR: qualitative data collected in a usability experiment seems to contradict the quantitative results of the SUS questionnaire. How can this discrepancy be reconciled?



The following experiment is conducted to evaluate the usability of a web-interface:



  1. Observe participants as they think aloud while using the interface to accomplish 8 tasks (the task order is randomized, this takes around 30 minutes)

  2. Give them a SUS form to fill out

  3. After they completed the survey, ask several follow-up questions to get more feedback (another 30 minutes)

So far, the experiment was conducted with 5 participants, then the UI was adjusted to address the found issues. A second round of 5 participants were then invited to go through the same steps.



It is planned to perform another round, with at least 5 participants (to obtain a sufficiently large sample). The current results are summarized below:



SUS survey results (confidence at 95%)



You can see that the v2 score is lower than v1.



These findings are puzzling, because:



  • the qualitative feedback I got from participants was more positive in v2


  • the changes between v1 and v2 were not ground-breaking, e.g.:



    • added tooltips to widgets

    • increased the contrast to make the active tab more prominent

    • changed wording to avoid technical jargon

    • shortened text


  • nevertheless, these tweaks did polish the "rough edges" of v1, as it was clear from the observations that there was less friction while participants used the site


In other words, the changes were small incremental steps that should have yielded small improvements. The qualitative results match the expectations, while the quantitative data do not.



Since the overall average of 69 falls in line with the average SUS score of 68, it seems that nothing unusual has happened and we're testing "just an average interface". However, I am not sure how to reconcile the fact that the numbers contradict the humane feedback.



Nielsen says that qualitative feedback is more valuable and numbers can lead you astray. On the other hand, Sauro says that they do report SUS scores based on a sample of 5 users (as well as looks at the history of sample sizes, concluding that a minimum of 5 is reasonable).



At the same time, a t-test says that the differences between the scores of v1 and v2 are not statistically significant.



How could one make sense of these results?




Thank you all for your comments, answers, and time. Although there is only one accepted answer, all the input is helpful. It enabled me to take a sober look at the data, and reduce the "jumptoconclusionness" factor to a lower level.



A note for future archaeologists: the question was edited to include details and statistics mentioned in the comments. It might help to look at the edit history to see the starting point and understand how it ended up like this.










share|improve this question
















TL;DR: qualitative data collected in a usability experiment seems to contradict the quantitative results of the SUS questionnaire. How can this discrepancy be reconciled?



The following experiment is conducted to evaluate the usability of a web-interface:



  1. Observe participants as they think aloud while using the interface to accomplish 8 tasks (the task order is randomized, this takes around 30 minutes)

  2. Give them a SUS form to fill out

  3. After they completed the survey, ask several follow-up questions to get more feedback (another 30 minutes)

So far, the experiment was conducted with 5 participants, then the UI was adjusted to address the found issues. A second round of 5 participants were then invited to go through the same steps.



It is planned to perform another round, with at least 5 participants (to obtain a sufficiently large sample). The current results are summarized below:



SUS survey results (confidence at 95%)



You can see that the v2 score is lower than v1.



These findings are puzzling, because:



  • the qualitative feedback I got from participants was more positive in v2


  • the changes between v1 and v2 were not ground-breaking, e.g.:



    • added tooltips to widgets

    • increased the contrast to make the active tab more prominent

    • changed wording to avoid technical jargon

    • shortened text


  • nevertheless, these tweaks did polish the "rough edges" of v1, as it was clear from the observations that there was less friction while participants used the site


In other words, the changes were small incremental steps that should have yielded small improvements. The qualitative results match the expectations, while the quantitative data do not.



Since the overall average of 69 falls in line with the average SUS score of 68, it seems that nothing unusual has happened and we're testing "just an average interface". However, I am not sure how to reconcile the fact that the numbers contradict the humane feedback.



Nielsen says that qualitative feedback is more valuable and numbers can lead you astray. On the other hand, Sauro says that they do report SUS scores based on a sample of 5 users (as well as looks at the history of sample sizes, concluding that a minimum of 5 is reasonable).



At the same time, a t-test says that the differences between the scores of v1 and v2 are not statistically significant.



How could one make sense of these results?




Thank you all for your comments, answers, and time. Although there is only one accepted answer, all the input is helpful. It enabled me to take a sober look at the data, and reduce the "jumptoconclusionness" factor to a lower level.



A note for future archaeologists: the question was edited to include details and statistics mentioned in the comments. It might help to look at the edit history to see the starting point and understand how it ended up like this.







usability-testing interview sus thinkaloud






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited yesterday







ralien

















asked May 11 at 13:23









ralienralien

1,425917




1,425917







  • 4





    Besides what the other people have already said, it's quite obvious that the result of v2 is massively set off by the outlier 37.5. This is a common issue when applying qualitative methods on a sample that's too small, if even one person produces extreme values, it automatically shifts the whole distribution.

    – Big_Chair
    May 11 at 18:52











  • Indeed, Nielsen wrote about this matter too, stating "For now, my best conclusion is that slow outliers are caused by bad luck rather than by a persistent property of the users in question.". So this is not uncommon; still, I can't just remove this result- that would be akin to omitting "inconvenient evidence". What would be an objective way to present these results?

    – ralien
    May 11 at 19:17











  • I am not 100% sure, but one approach could be to omit (or put less focus on) the calculated average value and instead treat each result individually, in conjunction with their qualitative results. I.e. "P1's feedback was ... with an accompanying score of 72.5" and only show the average at the end.

    – Big_Chair
    May 11 at 21:45











  • Or present them normally as they are but show them in a diagram as well and put strong focus on the outlier, while also presenting an alternative average value without it. I'd go with this type of diagram over a bar chart since the difference is more noticeable (might also just be my personal preference).

    – Big_Chair
    May 11 at 21:48












  • It may be more instructive to look at the median, especially when dealing with small samples of humans. Here you have 67.5 vs 70 for the medians, which are essentially the same.

    – Chris H
    2 days ago












  • 4





    Besides what the other people have already said, it's quite obvious that the result of v2 is massively set off by the outlier 37.5. This is a common issue when applying qualitative methods on a sample that's too small, if even one person produces extreme values, it automatically shifts the whole distribution.

    – Big_Chair
    May 11 at 18:52











  • Indeed, Nielsen wrote about this matter too, stating "For now, my best conclusion is that slow outliers are caused by bad luck rather than by a persistent property of the users in question.". So this is not uncommon; still, I can't just remove this result- that would be akin to omitting "inconvenient evidence". What would be an objective way to present these results?

    – ralien
    May 11 at 19:17











  • I am not 100% sure, but one approach could be to omit (or put less focus on) the calculated average value and instead treat each result individually, in conjunction with their qualitative results. I.e. "P1's feedback was ... with an accompanying score of 72.5" and only show the average at the end.

    – Big_Chair
    May 11 at 21:45











  • Or present them normally as they are but show them in a diagram as well and put strong focus on the outlier, while also presenting an alternative average value without it. I'd go with this type of diagram over a bar chart since the difference is more noticeable (might also just be my personal preference).

    – Big_Chair
    May 11 at 21:48












  • It may be more instructive to look at the median, especially when dealing with small samples of humans. Here you have 67.5 vs 70 for the medians, which are essentially the same.

    – Chris H
    2 days ago







4




4





Besides what the other people have already said, it's quite obvious that the result of v2 is massively set off by the outlier 37.5. This is a common issue when applying qualitative methods on a sample that's too small, if even one person produces extreme values, it automatically shifts the whole distribution.

– Big_Chair
May 11 at 18:52





Besides what the other people have already said, it's quite obvious that the result of v2 is massively set off by the outlier 37.5. This is a common issue when applying qualitative methods on a sample that's too small, if even one person produces extreme values, it automatically shifts the whole distribution.

– Big_Chair
May 11 at 18:52













Indeed, Nielsen wrote about this matter too, stating "For now, my best conclusion is that slow outliers are caused by bad luck rather than by a persistent property of the users in question.". So this is not uncommon; still, I can't just remove this result- that would be akin to omitting "inconvenient evidence". What would be an objective way to present these results?

– ralien
May 11 at 19:17





Indeed, Nielsen wrote about this matter too, stating "For now, my best conclusion is that slow outliers are caused by bad luck rather than by a persistent property of the users in question.". So this is not uncommon; still, I can't just remove this result- that would be akin to omitting "inconvenient evidence". What would be an objective way to present these results?

– ralien
May 11 at 19:17













I am not 100% sure, but one approach could be to omit (or put less focus on) the calculated average value and instead treat each result individually, in conjunction with their qualitative results. I.e. "P1's feedback was ... with an accompanying score of 72.5" and only show the average at the end.

– Big_Chair
May 11 at 21:45





I am not 100% sure, but one approach could be to omit (or put less focus on) the calculated average value and instead treat each result individually, in conjunction with their qualitative results. I.e. "P1's feedback was ... with an accompanying score of 72.5" and only show the average at the end.

– Big_Chair
May 11 at 21:45













Or present them normally as they are but show them in a diagram as well and put strong focus on the outlier, while also presenting an alternative average value without it. I'd go with this type of diagram over a bar chart since the difference is more noticeable (might also just be my personal preference).

– Big_Chair
May 11 at 21:48






Or present them normally as they are but show them in a diagram as well and put strong focus on the outlier, while also presenting an alternative average value without it. I'd go with this type of diagram over a bar chart since the difference is more noticeable (might also just be my personal preference).

– Big_Chair
May 11 at 21:48














It may be more instructive to look at the median, especially when dealing with small samples of humans. Here you have 67.5 vs 70 for the medians, which are essentially the same.

– Chris H
2 days ago





It may be more instructive to look at the median, especially when dealing with small samples of humans. Here you have 67.5 vs 70 for the medians, which are essentially the same.

– Chris H
2 days ago










5 Answers
5






active

oldest

votes


















20















How can this discrepancy be reconciled?




You have divergent results because the number of participants is small and not representative. There is no randomization or blinding to prevent bias. You're also not calculating the relevant stats. (What are the standard deviation, margin of error, confidence intervals, odds ratios, p values, etc?)



Further, you appear to be doing iterative design, not "experiments". There is nothing wrong with iterative design, but the data you collect are likely irrelevant beyond the current design. They cannot be used to meaningfully compare designs against each other. Even if they could, there aren't enough participants to measure the effect of small changes. But you don't need large numbers of users for iterative design. Just enough to identify improvements for the next iteration.



In an experiment, you'd have multiple designs A/B/C... tested in parallel. Participants would be randomized to the designs (as well as task order). Experimenters would not know which design individual participants were using. Experimenters would not observe participants directly. Experimenters would pre-decide what statistical tests are appropriate. They would not begin processing data until after it had all been collected. Etc. If you were testing drugs, your methodology (as well as insufficient participants) would likely prevent FDA approval.




How could one make sense of these results?




You did a t-test and found no significant difference. The "study" is likely underpowered with only five subjects in each group. Even if you had enough numbers to demonstrate significance, the study needs to be redesigned, and the survey has to be checked for reliability and validity.



The System Usability Scale (SUS) is described by its original developer as "quick and dirty". It appears to have been validated as a global assessment, but it's probably not appropriate for comparison. Imagine there were something known as Global Assessment of Functioning that physicians used to evaluate health. Is someone with condition A and GAF 85 "healthier" than someone with condition B and GAF of 80? Does it even make sense to compare A and B this way?



Even if these problems were all addressed, you are still doing iterative design. I would expect differences between successive iterations to be non-significant. Suppose you were testing drugs. Would you expect significantly different results between 100mg and 101mg doses? What about 101mg and 102mg? Etc. (How massive would n need to be to detect such minute differences?)




What to do... ?




Understand that iterative design is not experimentation. The value of small usability reviews is to screen for problems, not confirm success or produce stats.



Stop collecting (or "misusing") quantitative data when you know you won't have the numbers to demonstrate significance. Stop having "expectations", as it is a source of bias that can lead you astray. Redesign experiments to reduce bias.




... it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern.




That is as "expected".






share|improve this answer










New contributor



xiota is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.



















  • I updated the question with some details that I omitted initially: the task order was randomized, and the experiment is still in progress; the intention is to cover more participants in the last round, after considering the obtained feedback. I took some time to reflect on the intermediate results and posted my concerns here. 1. What do you mean by "blind" study in this case? Having the experiment conducted by different examiners? 2. The SUS is a standard survey, can I not trust its reliability and validity if I use it "as is"?

    – ralien
    May 12 at 0:51











  • 3. Expectations, how can we not have them? What's wrong with expecting that once a button is made bigger, participants will be more likely to notice it? There is bias, that is why we have peer review, statistical tests to keep us in check & forums such as this one. You pointed out that I didn't compute the relevant stats; after taking a few steps in that direction, it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern. I think having expectations is not an issue, as long as you're aware of the traps and take steps to avoid them.

    – ralien
    May 12 at 1:13











  • I'd like to tap into your knowledge a bit more, as my use of "experiment" more relaxed, compared to yours. 1) Have you got a favourite book about experiment design that you can recommend? [which would treat it with your rigour] 2) Same thing, but about statistics? 3) As in your medicine example, the changes between iterations are small - would it then break the universe if one computed a rough score by putting 3 batches (15 participants) together? The confidence interval would be wide, but wouldn't that at least indicate if the UI "totally sucks", is "okayish" or is "wowamazingtakemymoney!"?

    – ralien
    yesterday











  • 4) As one of the references cited in the question says, some would report SUS results for N=5; another answer cites a paper that concludes N=12 is an acceptable sample size. Before I started the tests, I had the [overly optimistic] view that if the 1st round turns out well, I'd just continue with it until N=15. However, subsequent changes were made in the UI. So the reasonable course of action is to take the UI of the 3rd iteration and test it with 12 people (assuming Tullis and Stetson know what they're talking about). Is there something obviously wrong with this approach?

    – ralien
    yesterday











  • I know enough to be dangerous... I can read and evaluate papers to decide whether I should care about them. But running a study is much different than critiquing a study. It's like the difference between recognizing a good novel vs writing one. I'm guessing for usability studies, you don't have to go through IRB because if you did, you'd be forced to be more "rigorous".

    – xiota
    yesterday


















11














How to reconcile the discrepancy? That I can't tell but here is why it might have happened.



The "5 users will find everything wrong with your system" refers to usability problems test subjects will find in your usability tests. Sauro has a great article that goes quite deep in to this "5 users is enough" thingie.




The five user number comes from the number of users you would need to detect approximately 85% of the problems in an interface, given that the probability a user would encounter a problem is about 31%.




Jeff Sauro - Why you only need to test with five users (explained)



SUS on the other hand needs bigger sample size to yield meaningful results. On their 2010 paper A Comparison of Questionnaires for Assessing Website Usability Tullis & Stetson came to a conclusion that you need a sample size of 12 or more for SUS to yield a 100 % accurate analysis of whatever you are analysing.




As one would expect, the accuracy of the analysis increases as the sample size gets larger. With a sample size of only 6, all of the questionnaires yield accuracy of only 30-40%, meaning that 60-70% of the time, at that sample size, you would fail to find a significant difference between the two sites.




So with sample size of 5 you'll get something meaningful in less than 30 % of the time.




For example, SUS jumps up to about 75% accuracy at a sample size of 8, while the others stay down in the 40-55% range. It’s also interesting to note that most of the questionnaires appear to reach an asymptote at a sample size of 12.




enter image description here



Then again, what is an adequate sample size for qualitative analysis. I haven't done any qualitative analysis in the UX field so I can't tell what would be the sufficient sample size. This I found on internet:




Qualitative sample sizes should be large enough to obtain enough data to sufficiently describe the phenomenon of interest and address the research questions.




Oh, really.




For an ethnography, Morse (1994) suggested approximately 30 – 50 participants. For grounded theory, Morse (1994) suggested 30 – 50 interviews, while Creswell (1998) suggested only 20 – 30. For phenomenological studies, Creswell (1998) recommends 5 – 25 and Morse (1994) suggests at least six.




Qualitative Sample Size






share|improve this answer























  • +1 Some really good references here. It's a pretty big call to say that SUS will yield a 100% accurate conclusion, since there is always some degree of uncertainty about what you are doing unless you can sample every single users. And even then they are not guaranteed to tell you exactly what they think...

    – Michael Lai
    May 12 at 21:58


















7














Look at your confidence intervals: the "real" score for v1 is somewhere between 58 and 88, while that score for v2 is somewhere between 51 and 79. Most notably, the confidence interval for each score contains the mean for the other score.



What this is telling you is that your sample size is too small. Based on the data you've collected so far, you cannot tell anything about the relative merits of the two versions of the user interface. There are various statistical tests you can use to quantify just how thoroughly you don't know anything, but checking for overlapping confidence intervals is fast and reasonably accurate.






share|improve this answer






























    2














    I think the most important thing is to realize that using a combination of qualitative and quantitative analysis methods is going to give you the most complete best picture of what the users are thinking and doing rather than relying on one or the other (which as you can see led to different conclusions). The idea is to use quantitative analysis when you have larger datasets to look for large trends or patterns, and use qualitative methods to drill in on the specific issues. It is also possible to combine various observations in qualitative methods and look for evidence/support of a general trend in the quantitative analysis.



    Something that I think is missing in the analysis (in additional to what has already been raised in the other answers) is that there are potentially other contextual information that you haven't used to split some of the responses.



    For example, I can imagine that there are different types of user groups mixed together in the analysis and that some of the users have a particular mental model or preferences that influence the more subjective elements of the SUS questionnaire.



    However, when you look at the quantitative data (which should probably also be based on some of the analytics collected while they are completing the task rather than just observations) you will probably find less bias or subjectivity as you build up the number of users/participants.






    share|improve this answer


















    • 2





      Combining methods is fine if they are used appropriately. The problem is OP is trying to use a screening methodology for confirmation. For subgroup analysis to be worthwhile, more participants are needed. With n=5, you're going to be looking at subgroups of n=1-2.

      – xiota
      2 days ago












    • @xiota I agree that the methodology chosen needs to be appropriate for the task. In general the assessment of usability can be impacted by some well-known factors (e.g. computer literacy, familiarity with similar technology, etc.) and it would have been interesting to see if things were taken into account, and whether it might help to explain some of the discrepancies or not.

      – Michael Lai
      2 days ago











    • It's "interesting" to speculate, but the underlying issue is n is far too small to "explain" anything. There might be a "reason" or it could be entirely random. That is what it means when the confidence intervals are so large. To support any explanations for the "discrepancies" would require running experiments with far more participants than the original "studies". It's just not worthwhile.

      – xiota
      2 days ago



















    0














    Three scores are higher and two are lower at first. One outlier (Voting 37 instead of 90) would be enough for that. Not to mention that the very close totals mean there is not a lot of difference. Could be that the mood/whether or your participants were different.



    Though there might actually be another issue. If we take the results at face value (even if the sample size is too small), the newer version seems to lack the ability to make people love it.



    For instance, while tool tips can be helpful and actually make someone find an option faster, I personally find them very annoying, usually. They just distract unnecessarily, especially if the option would be quite easy to find without it. And they make one feel pushed in a certain direction - similar to bad advertisement.



    Also, if you omitted information which would be expected, or dumbed down things too much, people would not always be happy about that - especially if they feel that usefulness is reduced. Some people might also dislike the lack of jargon if it belongs there. Or the new colors. And so on.



    As I don't know your ui in detail, I can't say more. But if you look at both versions side by side, you might see some of the annoying parts yourself, and can try to find alternatives - small hover texts instead of tool tips one has to acknowledge, for instance.






    share|improve this answer


















    • 1





      You're trying to explain a statistically insignificant difference. Even if you're correct, it's basically unverifiable guessing. The goal of iterative design is to make incremental improvements, not prove one iteration better than another.

      – xiota
      2 days ago












    • I said all that already in my answer...

      – Carl Dombrowski
      2 days ago











    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "102"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fux.stackexchange.com%2fquestions%2f125597%2fwhat-to-do-if-sus-scores-contradict-qualitative-feedback%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    5 Answers
    5






    active

    oldest

    votes








    5 Answers
    5






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    20















    How can this discrepancy be reconciled?




    You have divergent results because the number of participants is small and not representative. There is no randomization or blinding to prevent bias. You're also not calculating the relevant stats. (What are the standard deviation, margin of error, confidence intervals, odds ratios, p values, etc?)



    Further, you appear to be doing iterative design, not "experiments". There is nothing wrong with iterative design, but the data you collect are likely irrelevant beyond the current design. They cannot be used to meaningfully compare designs against each other. Even if they could, there aren't enough participants to measure the effect of small changes. But you don't need large numbers of users for iterative design. Just enough to identify improvements for the next iteration.



    In an experiment, you'd have multiple designs A/B/C... tested in parallel. Participants would be randomized to the designs (as well as task order). Experimenters would not know which design individual participants were using. Experimenters would not observe participants directly. Experimenters would pre-decide what statistical tests are appropriate. They would not begin processing data until after it had all been collected. Etc. If you were testing drugs, your methodology (as well as insufficient participants) would likely prevent FDA approval.




    How could one make sense of these results?




    You did a t-test and found no significant difference. The "study" is likely underpowered with only five subjects in each group. Even if you had enough numbers to demonstrate significance, the study needs to be redesigned, and the survey has to be checked for reliability and validity.



    The System Usability Scale (SUS) is described by its original developer as "quick and dirty". It appears to have been validated as a global assessment, but it's probably not appropriate for comparison. Imagine there were something known as Global Assessment of Functioning that physicians used to evaluate health. Is someone with condition A and GAF 85 "healthier" than someone with condition B and GAF of 80? Does it even make sense to compare A and B this way?



    Even if these problems were all addressed, you are still doing iterative design. I would expect differences between successive iterations to be non-significant. Suppose you were testing drugs. Would you expect significantly different results between 100mg and 101mg doses? What about 101mg and 102mg? Etc. (How massive would n need to be to detect such minute differences?)




    What to do... ?




    Understand that iterative design is not experimentation. The value of small usability reviews is to screen for problems, not confirm success or produce stats.



    Stop collecting (or "misusing") quantitative data when you know you won't have the numbers to demonstrate significance. Stop having "expectations", as it is a source of bias that can lead you astray. Redesign experiments to reduce bias.




    ... it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern.




    That is as "expected".






    share|improve this answer










    New contributor



    xiota is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.



















    • I updated the question with some details that I omitted initially: the task order was randomized, and the experiment is still in progress; the intention is to cover more participants in the last round, after considering the obtained feedback. I took some time to reflect on the intermediate results and posted my concerns here. 1. What do you mean by "blind" study in this case? Having the experiment conducted by different examiners? 2. The SUS is a standard survey, can I not trust its reliability and validity if I use it "as is"?

      – ralien
      May 12 at 0:51











    • 3. Expectations, how can we not have them? What's wrong with expecting that once a button is made bigger, participants will be more likely to notice it? There is bias, that is why we have peer review, statistical tests to keep us in check & forums such as this one. You pointed out that I didn't compute the relevant stats; after taking a few steps in that direction, it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern. I think having expectations is not an issue, as long as you're aware of the traps and take steps to avoid them.

      – ralien
      May 12 at 1:13











    • I'd like to tap into your knowledge a bit more, as my use of "experiment" more relaxed, compared to yours. 1) Have you got a favourite book about experiment design that you can recommend? [which would treat it with your rigour] 2) Same thing, but about statistics? 3) As in your medicine example, the changes between iterations are small - would it then break the universe if one computed a rough score by putting 3 batches (15 participants) together? The confidence interval would be wide, but wouldn't that at least indicate if the UI "totally sucks", is "okayish" or is "wowamazingtakemymoney!"?

      – ralien
      yesterday











    • 4) As one of the references cited in the question says, some would report SUS results for N=5; another answer cites a paper that concludes N=12 is an acceptable sample size. Before I started the tests, I had the [overly optimistic] view that if the 1st round turns out well, I'd just continue with it until N=15. However, subsequent changes were made in the UI. So the reasonable course of action is to take the UI of the 3rd iteration and test it with 12 people (assuming Tullis and Stetson know what they're talking about). Is there something obviously wrong with this approach?

      – ralien
      yesterday











    • I know enough to be dangerous... I can read and evaluate papers to decide whether I should care about them. But running a study is much different than critiquing a study. It's like the difference between recognizing a good novel vs writing one. I'm guessing for usability studies, you don't have to go through IRB because if you did, you'd be forced to be more "rigorous".

      – xiota
      yesterday















    20















    How can this discrepancy be reconciled?




    You have divergent results because the number of participants is small and not representative. There is no randomization or blinding to prevent bias. You're also not calculating the relevant stats. (What are the standard deviation, margin of error, confidence intervals, odds ratios, p values, etc?)



    Further, you appear to be doing iterative design, not "experiments". There is nothing wrong with iterative design, but the data you collect are likely irrelevant beyond the current design. They cannot be used to meaningfully compare designs against each other. Even if they could, there aren't enough participants to measure the effect of small changes. But you don't need large numbers of users for iterative design. Just enough to identify improvements for the next iteration.



    In an experiment, you'd have multiple designs A/B/C... tested in parallel. Participants would be randomized to the designs (as well as task order). Experimenters would not know which design individual participants were using. Experimenters would not observe participants directly. Experimenters would pre-decide what statistical tests are appropriate. They would not begin processing data until after it had all been collected. Etc. If you were testing drugs, your methodology (as well as insufficient participants) would likely prevent FDA approval.




    How could one make sense of these results?




    You did a t-test and found no significant difference. The "study" is likely underpowered with only five subjects in each group. Even if you had enough numbers to demonstrate significance, the study needs to be redesigned, and the survey has to be checked for reliability and validity.



    The System Usability Scale (SUS) is described by its original developer as "quick and dirty". It appears to have been validated as a global assessment, but it's probably not appropriate for comparison. Imagine there were something known as Global Assessment of Functioning that physicians used to evaluate health. Is someone with condition A and GAF 85 "healthier" than someone with condition B and GAF of 80? Does it even make sense to compare A and B this way?



    Even if these problems were all addressed, you are still doing iterative design. I would expect differences between successive iterations to be non-significant. Suppose you were testing drugs. Would you expect significantly different results between 100mg and 101mg doses? What about 101mg and 102mg? Etc. (How massive would n need to be to detect such minute differences?)




    What to do... ?




    Understand that iterative design is not experimentation. The value of small usability reviews is to screen for problems, not confirm success or produce stats.



    Stop collecting (or "misusing") quantitative data when you know you won't have the numbers to demonstrate significance. Stop having "expectations", as it is a source of bias that can lead you astray. Redesign experiments to reduce bias.




    ... it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern.




    That is as "expected".






    share|improve this answer










    New contributor



    xiota is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.



















    • I updated the question with some details that I omitted initially: the task order was randomized, and the experiment is still in progress; the intention is to cover more participants in the last round, after considering the obtained feedback. I took some time to reflect on the intermediate results and posted my concerns here. 1. What do you mean by "blind" study in this case? Having the experiment conducted by different examiners? 2. The SUS is a standard survey, can I not trust its reliability and validity if I use it "as is"?

      – ralien
      May 12 at 0:51











    • 3. Expectations, how can we not have them? What's wrong with expecting that once a button is made bigger, participants will be more likely to notice it? There is bias, that is why we have peer review, statistical tests to keep us in check & forums such as this one. You pointed out that I didn't compute the relevant stats; after taking a few steps in that direction, it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern. I think having expectations is not an issue, as long as you're aware of the traps and take steps to avoid them.

      – ralien
      May 12 at 1:13











    • I'd like to tap into your knowledge a bit more, as my use of "experiment" more relaxed, compared to yours. 1) Have you got a favourite book about experiment design that you can recommend? [which would treat it with your rigour] 2) Same thing, but about statistics? 3) As in your medicine example, the changes between iterations are small - would it then break the universe if one computed a rough score by putting 3 batches (15 participants) together? The confidence interval would be wide, but wouldn't that at least indicate if the UI "totally sucks", is "okayish" or is "wowamazingtakemymoney!"?

      – ralien
      yesterday











    • 4) As one of the references cited in the question says, some would report SUS results for N=5; another answer cites a paper that concludes N=12 is an acceptable sample size. Before I started the tests, I had the [overly optimistic] view that if the 1st round turns out well, I'd just continue with it until N=15. However, subsequent changes were made in the UI. So the reasonable course of action is to take the UI of the 3rd iteration and test it with 12 people (assuming Tullis and Stetson know what they're talking about). Is there something obviously wrong with this approach?

      – ralien
      yesterday











    • I know enough to be dangerous... I can read and evaluate papers to decide whether I should care about them. But running a study is much different than critiquing a study. It's like the difference between recognizing a good novel vs writing one. I'm guessing for usability studies, you don't have to go through IRB because if you did, you'd be forced to be more "rigorous".

      – xiota
      yesterday













    20












    20








    20








    How can this discrepancy be reconciled?




    You have divergent results because the number of participants is small and not representative. There is no randomization or blinding to prevent bias. You're also not calculating the relevant stats. (What are the standard deviation, margin of error, confidence intervals, odds ratios, p values, etc?)



    Further, you appear to be doing iterative design, not "experiments". There is nothing wrong with iterative design, but the data you collect are likely irrelevant beyond the current design. They cannot be used to meaningfully compare designs against each other. Even if they could, there aren't enough participants to measure the effect of small changes. But you don't need large numbers of users for iterative design. Just enough to identify improvements for the next iteration.



    In an experiment, you'd have multiple designs A/B/C... tested in parallel. Participants would be randomized to the designs (as well as task order). Experimenters would not know which design individual participants were using. Experimenters would not observe participants directly. Experimenters would pre-decide what statistical tests are appropriate. They would not begin processing data until after it had all been collected. Etc. If you were testing drugs, your methodology (as well as insufficient participants) would likely prevent FDA approval.




    How could one make sense of these results?




    You did a t-test and found no significant difference. The "study" is likely underpowered with only five subjects in each group. Even if you had enough numbers to demonstrate significance, the study needs to be redesigned, and the survey has to be checked for reliability and validity.



    The System Usability Scale (SUS) is described by its original developer as "quick and dirty". It appears to have been validated as a global assessment, but it's probably not appropriate for comparison. Imagine there were something known as Global Assessment of Functioning that physicians used to evaluate health. Is someone with condition A and GAF 85 "healthier" than someone with condition B and GAF of 80? Does it even make sense to compare A and B this way?



    Even if these problems were all addressed, you are still doing iterative design. I would expect differences between successive iterations to be non-significant. Suppose you were testing drugs. Would you expect significantly different results between 100mg and 101mg doses? What about 101mg and 102mg? Etc. (How massive would n need to be to detect such minute differences?)




    What to do... ?




    Understand that iterative design is not experimentation. The value of small usability reviews is to screen for problems, not confirm success or produce stats.



    Stop collecting (or "misusing") quantitative data when you know you won't have the numbers to demonstrate significance. Stop having "expectations", as it is a source of bias that can lead you astray. Redesign experiments to reduce bias.




    ... it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern.




    That is as "expected".






    share|improve this answer










    New contributor



    xiota is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.










    How can this discrepancy be reconciled?




    You have divergent results because the number of participants is small and not representative. There is no randomization or blinding to prevent bias. You're also not calculating the relevant stats. (What are the standard deviation, margin of error, confidence intervals, odds ratios, p values, etc?)



    Further, you appear to be doing iterative design, not "experiments". There is nothing wrong with iterative design, but the data you collect are likely irrelevant beyond the current design. They cannot be used to meaningfully compare designs against each other. Even if they could, there aren't enough participants to measure the effect of small changes. But you don't need large numbers of users for iterative design. Just enough to identify improvements for the next iteration.



    In an experiment, you'd have multiple designs A/B/C... tested in parallel. Participants would be randomized to the designs (as well as task order). Experimenters would not know which design individual participants were using. Experimenters would not observe participants directly. Experimenters would pre-decide what statistical tests are appropriate. They would not begin processing data until after it had all been collected. Etc. If you were testing drugs, your methodology (as well as insufficient participants) would likely prevent FDA approval.




    How could one make sense of these results?




    You did a t-test and found no significant difference. The "study" is likely underpowered with only five subjects in each group. Even if you had enough numbers to demonstrate significance, the study needs to be redesigned, and the survey has to be checked for reliability and validity.



    The System Usability Scale (SUS) is described by its original developer as "quick and dirty". It appears to have been validated as a global assessment, but it's probably not appropriate for comparison. Imagine there were something known as Global Assessment of Functioning that physicians used to evaluate health. Is someone with condition A and GAF 85 "healthier" than someone with condition B and GAF of 80? Does it even make sense to compare A and B this way?



    Even if these problems were all addressed, you are still doing iterative design. I would expect differences between successive iterations to be non-significant. Suppose you were testing drugs. Would you expect significantly different results between 100mg and 101mg doses? What about 101mg and 102mg? Etc. (How massive would n need to be to detect such minute differences?)




    What to do... ?




    Understand that iterative design is not experimentation. The value of small usability reviews is to screen for problems, not confirm success or produce stats.



    Stop collecting (or "misusing") quantitative data when you know you won't have the numbers to demonstrate significance. Stop having "expectations", as it is a source of bias that can lead you astray. Redesign experiments to reduce bias.




    ... it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern.




    That is as "expected".







    share|improve this answer










    New contributor



    xiota is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.








    share|improve this answer



    share|improve this answer








    edited May 12 at 20:53





















    New contributor



    xiota is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.








    answered May 11 at 16:16









    xiotaxiota

    831213




    831213




    New contributor



    xiota is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.




    New contributor




    xiota is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.














    • I updated the question with some details that I omitted initially: the task order was randomized, and the experiment is still in progress; the intention is to cover more participants in the last round, after considering the obtained feedback. I took some time to reflect on the intermediate results and posted my concerns here. 1. What do you mean by "blind" study in this case? Having the experiment conducted by different examiners? 2. The SUS is a standard survey, can I not trust its reliability and validity if I use it "as is"?

      – ralien
      May 12 at 0:51











    • 3. Expectations, how can we not have them? What's wrong with expecting that once a button is made bigger, participants will be more likely to notice it? There is bias, that is why we have peer review, statistical tests to keep us in check & forums such as this one. You pointed out that I didn't compute the relevant stats; after taking a few steps in that direction, it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern. I think having expectations is not an issue, as long as you're aware of the traps and take steps to avoid them.

      – ralien
      May 12 at 1:13











    • I'd like to tap into your knowledge a bit more, as my use of "experiment" more relaxed, compared to yours. 1) Have you got a favourite book about experiment design that you can recommend? [which would treat it with your rigour] 2) Same thing, but about statistics? 3) As in your medicine example, the changes between iterations are small - would it then break the universe if one computed a rough score by putting 3 batches (15 participants) together? The confidence interval would be wide, but wouldn't that at least indicate if the UI "totally sucks", is "okayish" or is "wowamazingtakemymoney!"?

      – ralien
      yesterday











    • 4) As one of the references cited in the question says, some would report SUS results for N=5; another answer cites a paper that concludes N=12 is an acceptable sample size. Before I started the tests, I had the [overly optimistic] view that if the 1st round turns out well, I'd just continue with it until N=15. However, subsequent changes were made in the UI. So the reasonable course of action is to take the UI of the 3rd iteration and test it with 12 people (assuming Tullis and Stetson know what they're talking about). Is there something obviously wrong with this approach?

      – ralien
      yesterday











    • I know enough to be dangerous... I can read and evaluate papers to decide whether I should care about them. But running a study is much different than critiquing a study. It's like the difference between recognizing a good novel vs writing one. I'm guessing for usability studies, you don't have to go through IRB because if you did, you'd be forced to be more "rigorous".

      – xiota
      yesterday

















    • I updated the question with some details that I omitted initially: the task order was randomized, and the experiment is still in progress; the intention is to cover more participants in the last round, after considering the obtained feedback. I took some time to reflect on the intermediate results and posted my concerns here. 1. What do you mean by "blind" study in this case? Having the experiment conducted by different examiners? 2. The SUS is a standard survey, can I not trust its reliability and validity if I use it "as is"?

      – ralien
      May 12 at 0:51











    • 3. Expectations, how can we not have them? What's wrong with expecting that once a button is made bigger, participants will be more likely to notice it? There is bias, that is why we have peer review, statistical tests to keep us in check & forums such as this one. You pointed out that I didn't compute the relevant stats; after taking a few steps in that direction, it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern. I think having expectations is not an issue, as long as you're aware of the traps and take steps to avoid them.

      – ralien
      May 12 at 1:13











    • I'd like to tap into your knowledge a bit more, as my use of "experiment" more relaxed, compared to yours. 1) Have you got a favourite book about experiment design that you can recommend? [which would treat it with your rigour] 2) Same thing, but about statistics? 3) As in your medicine example, the changes between iterations are small - would it then break the universe if one computed a rough score by putting 3 batches (15 participants) together? The confidence interval would be wide, but wouldn't that at least indicate if the UI "totally sucks", is "okayish" or is "wowamazingtakemymoney!"?

      – ralien
      yesterday











    • 4) As one of the references cited in the question says, some would report SUS results for N=5; another answer cites a paper that concludes N=12 is an acceptable sample size. Before I started the tests, I had the [overly optimistic] view that if the 1st round turns out well, I'd just continue with it until N=15. However, subsequent changes were made in the UI. So the reasonable course of action is to take the UI of the 3rd iteration and test it with 12 people (assuming Tullis and Stetson know what they're talking about). Is there something obviously wrong with this approach?

      – ralien
      yesterday











    • I know enough to be dangerous... I can read and evaluate papers to decide whether I should care about them. But running a study is much different than critiquing a study. It's like the difference between recognizing a good novel vs writing one. I'm guessing for usability studies, you don't have to go through IRB because if you did, you'd be forced to be more "rigorous".

      – xiota
      yesterday
















    I updated the question with some details that I omitted initially: the task order was randomized, and the experiment is still in progress; the intention is to cover more participants in the last round, after considering the obtained feedback. I took some time to reflect on the intermediate results and posted my concerns here. 1. What do you mean by "blind" study in this case? Having the experiment conducted by different examiners? 2. The SUS is a standard survey, can I not trust its reliability and validity if I use it "as is"?

    – ralien
    May 12 at 0:51





    I updated the question with some details that I omitted initially: the task order was randomized, and the experiment is still in progress; the intention is to cover more participants in the last round, after considering the obtained feedback. I took some time to reflect on the intermediate results and posted my concerns here. 1. What do you mean by "blind" study in this case? Having the experiment conducted by different examiners? 2. The SUS is a standard survey, can I not trust its reliability and validity if I use it "as is"?

    – ralien
    May 12 at 0:51













    3. Expectations, how can we not have them? What's wrong with expecting that once a button is made bigger, participants will be more likely to notice it? There is bias, that is why we have peer review, statistical tests to keep us in check & forums such as this one. You pointed out that I didn't compute the relevant stats; after taking a few steps in that direction, it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern. I think having expectations is not an issue, as long as you're aware of the traps and take steps to avoid them.

    – ralien
    May 12 at 1:13





    3. Expectations, how can we not have them? What's wrong with expecting that once a button is made bigger, participants will be more likely to notice it? There is bias, that is why we have peer review, statistical tests to keep us in check & forums such as this one. You pointed out that I didn't compute the relevant stats; after taking a few steps in that direction, it seems the confidence intervals are so wide, that the intermediate results I got should not be a reason of concern. I think having expectations is not an issue, as long as you're aware of the traps and take steps to avoid them.

    – ralien
    May 12 at 1:13













    I'd like to tap into your knowledge a bit more, as my use of "experiment" more relaxed, compared to yours. 1) Have you got a favourite book about experiment design that you can recommend? [which would treat it with your rigour] 2) Same thing, but about statistics? 3) As in your medicine example, the changes between iterations are small - would it then break the universe if one computed a rough score by putting 3 batches (15 participants) together? The confidence interval would be wide, but wouldn't that at least indicate if the UI "totally sucks", is "okayish" or is "wowamazingtakemymoney!"?

    – ralien
    yesterday





    I'd like to tap into your knowledge a bit more, as my use of "experiment" more relaxed, compared to yours. 1) Have you got a favourite book about experiment design that you can recommend? [which would treat it with your rigour] 2) Same thing, but about statistics? 3) As in your medicine example, the changes between iterations are small - would it then break the universe if one computed a rough score by putting 3 batches (15 participants) together? The confidence interval would be wide, but wouldn't that at least indicate if the UI "totally sucks", is "okayish" or is "wowamazingtakemymoney!"?

    – ralien
    yesterday













    4) As one of the references cited in the question says, some would report SUS results for N=5; another answer cites a paper that concludes N=12 is an acceptable sample size. Before I started the tests, I had the [overly optimistic] view that if the 1st round turns out well, I'd just continue with it until N=15. However, subsequent changes were made in the UI. So the reasonable course of action is to take the UI of the 3rd iteration and test it with 12 people (assuming Tullis and Stetson know what they're talking about). Is there something obviously wrong with this approach?

    – ralien
    yesterday





    4) As one of the references cited in the question says, some would report SUS results for N=5; another answer cites a paper that concludes N=12 is an acceptable sample size. Before I started the tests, I had the [overly optimistic] view that if the 1st round turns out well, I'd just continue with it until N=15. However, subsequent changes were made in the UI. So the reasonable course of action is to take the UI of the 3rd iteration and test it with 12 people (assuming Tullis and Stetson know what they're talking about). Is there something obviously wrong with this approach?

    – ralien
    yesterday













    I know enough to be dangerous... I can read and evaluate papers to decide whether I should care about them. But running a study is much different than critiquing a study. It's like the difference between recognizing a good novel vs writing one. I'm guessing for usability studies, you don't have to go through IRB because if you did, you'd be forced to be more "rigorous".

    – xiota
    yesterday





    I know enough to be dangerous... I can read and evaluate papers to decide whether I should care about them. But running a study is much different than critiquing a study. It's like the difference between recognizing a good novel vs writing one. I'm guessing for usability studies, you don't have to go through IRB because if you did, you'd be forced to be more "rigorous".

    – xiota
    yesterday













    11














    How to reconcile the discrepancy? That I can't tell but here is why it might have happened.



    The "5 users will find everything wrong with your system" refers to usability problems test subjects will find in your usability tests. Sauro has a great article that goes quite deep in to this "5 users is enough" thingie.




    The five user number comes from the number of users you would need to detect approximately 85% of the problems in an interface, given that the probability a user would encounter a problem is about 31%.




    Jeff Sauro - Why you only need to test with five users (explained)



    SUS on the other hand needs bigger sample size to yield meaningful results. On their 2010 paper A Comparison of Questionnaires for Assessing Website Usability Tullis & Stetson came to a conclusion that you need a sample size of 12 or more for SUS to yield a 100 % accurate analysis of whatever you are analysing.




    As one would expect, the accuracy of the analysis increases as the sample size gets larger. With a sample size of only 6, all of the questionnaires yield accuracy of only 30-40%, meaning that 60-70% of the time, at that sample size, you would fail to find a significant difference between the two sites.




    So with sample size of 5 you'll get something meaningful in less than 30 % of the time.




    For example, SUS jumps up to about 75% accuracy at a sample size of 8, while the others stay down in the 40-55% range. It’s also interesting to note that most of the questionnaires appear to reach an asymptote at a sample size of 12.




    enter image description here



    Then again, what is an adequate sample size for qualitative analysis. I haven't done any qualitative analysis in the UX field so I can't tell what would be the sufficient sample size. This I found on internet:




    Qualitative sample sizes should be large enough to obtain enough data to sufficiently describe the phenomenon of interest and address the research questions.




    Oh, really.




    For an ethnography, Morse (1994) suggested approximately 30 – 50 participants. For grounded theory, Morse (1994) suggested 30 – 50 interviews, while Creswell (1998) suggested only 20 – 30. For phenomenological studies, Creswell (1998) recommends 5 – 25 and Morse (1994) suggests at least six.




    Qualitative Sample Size






    share|improve this answer























    • +1 Some really good references here. It's a pretty big call to say that SUS will yield a 100% accurate conclusion, since there is always some degree of uncertainty about what you are doing unless you can sample every single users. And even then they are not guaranteed to tell you exactly what they think...

      – Michael Lai
      May 12 at 21:58















    11














    How to reconcile the discrepancy? That I can't tell but here is why it might have happened.



    The "5 users will find everything wrong with your system" refers to usability problems test subjects will find in your usability tests. Sauro has a great article that goes quite deep in to this "5 users is enough" thingie.




    The five user number comes from the number of users you would need to detect approximately 85% of the problems in an interface, given that the probability a user would encounter a problem is about 31%.




    Jeff Sauro - Why you only need to test with five users (explained)



    SUS on the other hand needs bigger sample size to yield meaningful results. On their 2010 paper A Comparison of Questionnaires for Assessing Website Usability Tullis & Stetson came to a conclusion that you need a sample size of 12 or more for SUS to yield a 100 % accurate analysis of whatever you are analysing.




    As one would expect, the accuracy of the analysis increases as the sample size gets larger. With a sample size of only 6, all of the questionnaires yield accuracy of only 30-40%, meaning that 60-70% of the time, at that sample size, you would fail to find a significant difference between the two sites.




    So with sample size of 5 you'll get something meaningful in less than 30 % of the time.




    For example, SUS jumps up to about 75% accuracy at a sample size of 8, while the others stay down in the 40-55% range. It’s also interesting to note that most of the questionnaires appear to reach an asymptote at a sample size of 12.




    enter image description here



    Then again, what is an adequate sample size for qualitative analysis. I haven't done any qualitative analysis in the UX field so I can't tell what would be the sufficient sample size. This I found on internet:




    Qualitative sample sizes should be large enough to obtain enough data to sufficiently describe the phenomenon of interest and address the research questions.




    Oh, really.




    For an ethnography, Morse (1994) suggested approximately 30 – 50 participants. For grounded theory, Morse (1994) suggested 30 – 50 interviews, while Creswell (1998) suggested only 20 – 30. For phenomenological studies, Creswell (1998) recommends 5 – 25 and Morse (1994) suggests at least six.




    Qualitative Sample Size






    share|improve this answer























    • +1 Some really good references here. It's a pretty big call to say that SUS will yield a 100% accurate conclusion, since there is always some degree of uncertainty about what you are doing unless you can sample every single users. And even then they are not guaranteed to tell you exactly what they think...

      – Michael Lai
      May 12 at 21:58













    11












    11








    11







    How to reconcile the discrepancy? That I can't tell but here is why it might have happened.



    The "5 users will find everything wrong with your system" refers to usability problems test subjects will find in your usability tests. Sauro has a great article that goes quite deep in to this "5 users is enough" thingie.




    The five user number comes from the number of users you would need to detect approximately 85% of the problems in an interface, given that the probability a user would encounter a problem is about 31%.




    Jeff Sauro - Why you only need to test with five users (explained)



    SUS on the other hand needs bigger sample size to yield meaningful results. On their 2010 paper A Comparison of Questionnaires for Assessing Website Usability Tullis & Stetson came to a conclusion that you need a sample size of 12 or more for SUS to yield a 100 % accurate analysis of whatever you are analysing.




    As one would expect, the accuracy of the analysis increases as the sample size gets larger. With a sample size of only 6, all of the questionnaires yield accuracy of only 30-40%, meaning that 60-70% of the time, at that sample size, you would fail to find a significant difference between the two sites.




    So with sample size of 5 you'll get something meaningful in less than 30 % of the time.




    For example, SUS jumps up to about 75% accuracy at a sample size of 8, while the others stay down in the 40-55% range. It’s also interesting to note that most of the questionnaires appear to reach an asymptote at a sample size of 12.




    enter image description here



    Then again, what is an adequate sample size for qualitative analysis. I haven't done any qualitative analysis in the UX field so I can't tell what would be the sufficient sample size. This I found on internet:




    Qualitative sample sizes should be large enough to obtain enough data to sufficiently describe the phenomenon of interest and address the research questions.




    Oh, really.




    For an ethnography, Morse (1994) suggested approximately 30 – 50 participants. For grounded theory, Morse (1994) suggested 30 – 50 interviews, while Creswell (1998) suggested only 20 – 30. For phenomenological studies, Creswell (1998) recommends 5 – 25 and Morse (1994) suggests at least six.




    Qualitative Sample Size






    share|improve this answer













    How to reconcile the discrepancy? That I can't tell but here is why it might have happened.



    The "5 users will find everything wrong with your system" refers to usability problems test subjects will find in your usability tests. Sauro has a great article that goes quite deep in to this "5 users is enough" thingie.




    The five user number comes from the number of users you would need to detect approximately 85% of the problems in an interface, given that the probability a user would encounter a problem is about 31%.




    Jeff Sauro - Why you only need to test with five users (explained)



    SUS on the other hand needs bigger sample size to yield meaningful results. On their 2010 paper A Comparison of Questionnaires for Assessing Website Usability Tullis & Stetson came to a conclusion that you need a sample size of 12 or more for SUS to yield a 100 % accurate analysis of whatever you are analysing.




    As one would expect, the accuracy of the analysis increases as the sample size gets larger. With a sample size of only 6, all of the questionnaires yield accuracy of only 30-40%, meaning that 60-70% of the time, at that sample size, you would fail to find a significant difference between the two sites.




    So with sample size of 5 you'll get something meaningful in less than 30 % of the time.




    For example, SUS jumps up to about 75% accuracy at a sample size of 8, while the others stay down in the 40-55% range. It’s also interesting to note that most of the questionnaires appear to reach an asymptote at a sample size of 12.




    enter image description here



    Then again, what is an adequate sample size for qualitative analysis. I haven't done any qualitative analysis in the UX field so I can't tell what would be the sufficient sample size. This I found on internet:




    Qualitative sample sizes should be large enough to obtain enough data to sufficiently describe the phenomenon of interest and address the research questions.




    Oh, really.




    For an ethnography, Morse (1994) suggested approximately 30 – 50 participants. For grounded theory, Morse (1994) suggested 30 – 50 interviews, while Creswell (1998) suggested only 20 – 30. For phenomenological studies, Creswell (1998) recommends 5 – 25 and Morse (1994) suggests at least six.




    Qualitative Sample Size







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered May 11 at 15:55









    locationunknownlocationunknown

    3,78151531




    3,78151531












    • +1 Some really good references here. It's a pretty big call to say that SUS will yield a 100% accurate conclusion, since there is always some degree of uncertainty about what you are doing unless you can sample every single users. And even then they are not guaranteed to tell you exactly what they think...

      – Michael Lai
      May 12 at 21:58

















    • +1 Some really good references here. It's a pretty big call to say that SUS will yield a 100% accurate conclusion, since there is always some degree of uncertainty about what you are doing unless you can sample every single users. And even then they are not guaranteed to tell you exactly what they think...

      – Michael Lai
      May 12 at 21:58
















    +1 Some really good references here. It's a pretty big call to say that SUS will yield a 100% accurate conclusion, since there is always some degree of uncertainty about what you are doing unless you can sample every single users. And even then they are not guaranteed to tell you exactly what they think...

    – Michael Lai
    May 12 at 21:58





    +1 Some really good references here. It's a pretty big call to say that SUS will yield a 100% accurate conclusion, since there is always some degree of uncertainty about what you are doing unless you can sample every single users. And even then they are not guaranteed to tell you exactly what they think...

    – Michael Lai
    May 12 at 21:58











    7














    Look at your confidence intervals: the "real" score for v1 is somewhere between 58 and 88, while that score for v2 is somewhere between 51 and 79. Most notably, the confidence interval for each score contains the mean for the other score.



    What this is telling you is that your sample size is too small. Based on the data you've collected so far, you cannot tell anything about the relative merits of the two versions of the user interface. There are various statistical tests you can use to quantify just how thoroughly you don't know anything, but checking for overlapping confidence intervals is fast and reasonably accurate.






    share|improve this answer



























      7














      Look at your confidence intervals: the "real" score for v1 is somewhere between 58 and 88, while that score for v2 is somewhere between 51 and 79. Most notably, the confidence interval for each score contains the mean for the other score.



      What this is telling you is that your sample size is too small. Based on the data you've collected so far, you cannot tell anything about the relative merits of the two versions of the user interface. There are various statistical tests you can use to quantify just how thoroughly you don't know anything, but checking for overlapping confidence intervals is fast and reasonably accurate.






      share|improve this answer

























        7












        7








        7







        Look at your confidence intervals: the "real" score for v1 is somewhere between 58 and 88, while that score for v2 is somewhere between 51 and 79. Most notably, the confidence interval for each score contains the mean for the other score.



        What this is telling you is that your sample size is too small. Based on the data you've collected so far, you cannot tell anything about the relative merits of the two versions of the user interface. There are various statistical tests you can use to quantify just how thoroughly you don't know anything, but checking for overlapping confidence intervals is fast and reasonably accurate.






        share|improve this answer













        Look at your confidence intervals: the "real" score for v1 is somewhere between 58 and 88, while that score for v2 is somewhere between 51 and 79. Most notably, the confidence interval for each score contains the mean for the other score.



        What this is telling you is that your sample size is too small. Based on the data you've collected so far, you cannot tell anything about the relative merits of the two versions of the user interface. There are various statistical tests you can use to quantify just how thoroughly you don't know anything, but checking for overlapping confidence intervals is fast and reasonably accurate.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered May 12 at 4:32









        MarkMark

        1,434613




        1,434613





















            2














            I think the most important thing is to realize that using a combination of qualitative and quantitative analysis methods is going to give you the most complete best picture of what the users are thinking and doing rather than relying on one or the other (which as you can see led to different conclusions). The idea is to use quantitative analysis when you have larger datasets to look for large trends or patterns, and use qualitative methods to drill in on the specific issues. It is also possible to combine various observations in qualitative methods and look for evidence/support of a general trend in the quantitative analysis.



            Something that I think is missing in the analysis (in additional to what has already been raised in the other answers) is that there are potentially other contextual information that you haven't used to split some of the responses.



            For example, I can imagine that there are different types of user groups mixed together in the analysis and that some of the users have a particular mental model or preferences that influence the more subjective elements of the SUS questionnaire.



            However, when you look at the quantitative data (which should probably also be based on some of the analytics collected while they are completing the task rather than just observations) you will probably find less bias or subjectivity as you build up the number of users/participants.






            share|improve this answer


















            • 2





              Combining methods is fine if they are used appropriately. The problem is OP is trying to use a screening methodology for confirmation. For subgroup analysis to be worthwhile, more participants are needed. With n=5, you're going to be looking at subgroups of n=1-2.

              – xiota
              2 days ago












            • @xiota I agree that the methodology chosen needs to be appropriate for the task. In general the assessment of usability can be impacted by some well-known factors (e.g. computer literacy, familiarity with similar technology, etc.) and it would have been interesting to see if things were taken into account, and whether it might help to explain some of the discrepancies or not.

              – Michael Lai
              2 days ago











            • It's "interesting" to speculate, but the underlying issue is n is far too small to "explain" anything. There might be a "reason" or it could be entirely random. That is what it means when the confidence intervals are so large. To support any explanations for the "discrepancies" would require running experiments with far more participants than the original "studies". It's just not worthwhile.

              – xiota
              2 days ago
















            2














            I think the most important thing is to realize that using a combination of qualitative and quantitative analysis methods is going to give you the most complete best picture of what the users are thinking and doing rather than relying on one or the other (which as you can see led to different conclusions). The idea is to use quantitative analysis when you have larger datasets to look for large trends or patterns, and use qualitative methods to drill in on the specific issues. It is also possible to combine various observations in qualitative methods and look for evidence/support of a general trend in the quantitative analysis.



            Something that I think is missing in the analysis (in additional to what has already been raised in the other answers) is that there are potentially other contextual information that you haven't used to split some of the responses.



            For example, I can imagine that there are different types of user groups mixed together in the analysis and that some of the users have a particular mental model or preferences that influence the more subjective elements of the SUS questionnaire.



            However, when you look at the quantitative data (which should probably also be based on some of the analytics collected while they are completing the task rather than just observations) you will probably find less bias or subjectivity as you build up the number of users/participants.






            share|improve this answer


















            • 2





              Combining methods is fine if they are used appropriately. The problem is OP is trying to use a screening methodology for confirmation. For subgroup analysis to be worthwhile, more participants are needed. With n=5, you're going to be looking at subgroups of n=1-2.

              – xiota
              2 days ago












            • @xiota I agree that the methodology chosen needs to be appropriate for the task. In general the assessment of usability can be impacted by some well-known factors (e.g. computer literacy, familiarity with similar technology, etc.) and it would have been interesting to see if things were taken into account, and whether it might help to explain some of the discrepancies or not.

              – Michael Lai
              2 days ago











            • It's "interesting" to speculate, but the underlying issue is n is far too small to "explain" anything. There might be a "reason" or it could be entirely random. That is what it means when the confidence intervals are so large. To support any explanations for the "discrepancies" would require running experiments with far more participants than the original "studies". It's just not worthwhile.

              – xiota
              2 days ago














            2












            2








            2







            I think the most important thing is to realize that using a combination of qualitative and quantitative analysis methods is going to give you the most complete best picture of what the users are thinking and doing rather than relying on one or the other (which as you can see led to different conclusions). The idea is to use quantitative analysis when you have larger datasets to look for large trends or patterns, and use qualitative methods to drill in on the specific issues. It is also possible to combine various observations in qualitative methods and look for evidence/support of a general trend in the quantitative analysis.



            Something that I think is missing in the analysis (in additional to what has already been raised in the other answers) is that there are potentially other contextual information that you haven't used to split some of the responses.



            For example, I can imagine that there are different types of user groups mixed together in the analysis and that some of the users have a particular mental model or preferences that influence the more subjective elements of the SUS questionnaire.



            However, when you look at the quantitative data (which should probably also be based on some of the analytics collected while they are completing the task rather than just observations) you will probably find less bias or subjectivity as you build up the number of users/participants.






            share|improve this answer













            I think the most important thing is to realize that using a combination of qualitative and quantitative analysis methods is going to give you the most complete best picture of what the users are thinking and doing rather than relying on one or the other (which as you can see led to different conclusions). The idea is to use quantitative analysis when you have larger datasets to look for large trends or patterns, and use qualitative methods to drill in on the specific issues. It is also possible to combine various observations in qualitative methods and look for evidence/support of a general trend in the quantitative analysis.



            Something that I think is missing in the analysis (in additional to what has already been raised in the other answers) is that there are potentially other contextual information that you haven't used to split some of the responses.



            For example, I can imagine that there are different types of user groups mixed together in the analysis and that some of the users have a particular mental model or preferences that influence the more subjective elements of the SUS questionnaire.



            However, when you look at the quantitative data (which should probably also be based on some of the analytics collected while they are completing the task rather than just observations) you will probably find less bias or subjectivity as you build up the number of users/participants.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered May 12 at 22:07









            Michael LaiMichael Lai

            15.1k1165144




            15.1k1165144







            • 2





              Combining methods is fine if they are used appropriately. The problem is OP is trying to use a screening methodology for confirmation. For subgroup analysis to be worthwhile, more participants are needed. With n=5, you're going to be looking at subgroups of n=1-2.

              – xiota
              2 days ago












            • @xiota I agree that the methodology chosen needs to be appropriate for the task. In general the assessment of usability can be impacted by some well-known factors (e.g. computer literacy, familiarity with similar technology, etc.) and it would have been interesting to see if things were taken into account, and whether it might help to explain some of the discrepancies or not.

              – Michael Lai
              2 days ago











            • It's "interesting" to speculate, but the underlying issue is n is far too small to "explain" anything. There might be a "reason" or it could be entirely random. That is what it means when the confidence intervals are so large. To support any explanations for the "discrepancies" would require running experiments with far more participants than the original "studies". It's just not worthwhile.

              – xiota
              2 days ago













            • 2





              Combining methods is fine if they are used appropriately. The problem is OP is trying to use a screening methodology for confirmation. For subgroup analysis to be worthwhile, more participants are needed. With n=5, you're going to be looking at subgroups of n=1-2.

              – xiota
              2 days ago












            • @xiota I agree that the methodology chosen needs to be appropriate for the task. In general the assessment of usability can be impacted by some well-known factors (e.g. computer literacy, familiarity with similar technology, etc.) and it would have been interesting to see if things were taken into account, and whether it might help to explain some of the discrepancies or not.

              – Michael Lai
              2 days ago











            • It's "interesting" to speculate, but the underlying issue is n is far too small to "explain" anything. There might be a "reason" or it could be entirely random. That is what it means when the confidence intervals are so large. To support any explanations for the "discrepancies" would require running experiments with far more participants than the original "studies". It's just not worthwhile.

              – xiota
              2 days ago








            2




            2





            Combining methods is fine if they are used appropriately. The problem is OP is trying to use a screening methodology for confirmation. For subgroup analysis to be worthwhile, more participants are needed. With n=5, you're going to be looking at subgroups of n=1-2.

            – xiota
            2 days ago






            Combining methods is fine if they are used appropriately. The problem is OP is trying to use a screening methodology for confirmation. For subgroup analysis to be worthwhile, more participants are needed. With n=5, you're going to be looking at subgroups of n=1-2.

            – xiota
            2 days ago














            @xiota I agree that the methodology chosen needs to be appropriate for the task. In general the assessment of usability can be impacted by some well-known factors (e.g. computer literacy, familiarity with similar technology, etc.) and it would have been interesting to see if things were taken into account, and whether it might help to explain some of the discrepancies or not.

            – Michael Lai
            2 days ago





            @xiota I agree that the methodology chosen needs to be appropriate for the task. In general the assessment of usability can be impacted by some well-known factors (e.g. computer literacy, familiarity with similar technology, etc.) and it would have been interesting to see if things were taken into account, and whether it might help to explain some of the discrepancies or not.

            – Michael Lai
            2 days ago













            It's "interesting" to speculate, but the underlying issue is n is far too small to "explain" anything. There might be a "reason" or it could be entirely random. That is what it means when the confidence intervals are so large. To support any explanations for the "discrepancies" would require running experiments with far more participants than the original "studies". It's just not worthwhile.

            – xiota
            2 days ago






            It's "interesting" to speculate, but the underlying issue is n is far too small to "explain" anything. There might be a "reason" or it could be entirely random. That is what it means when the confidence intervals are so large. To support any explanations for the "discrepancies" would require running experiments with far more participants than the original "studies". It's just not worthwhile.

            – xiota
            2 days ago












            0














            Three scores are higher and two are lower at first. One outlier (Voting 37 instead of 90) would be enough for that. Not to mention that the very close totals mean there is not a lot of difference. Could be that the mood/whether or your participants were different.



            Though there might actually be another issue. If we take the results at face value (even if the sample size is too small), the newer version seems to lack the ability to make people love it.



            For instance, while tool tips can be helpful and actually make someone find an option faster, I personally find them very annoying, usually. They just distract unnecessarily, especially if the option would be quite easy to find without it. And they make one feel pushed in a certain direction - similar to bad advertisement.



            Also, if you omitted information which would be expected, or dumbed down things too much, people would not always be happy about that - especially if they feel that usefulness is reduced. Some people might also dislike the lack of jargon if it belongs there. Or the new colors. And so on.



            As I don't know your ui in detail, I can't say more. But if you look at both versions side by side, you might see some of the annoying parts yourself, and can try to find alternatives - small hover texts instead of tool tips one has to acknowledge, for instance.






            share|improve this answer


















            • 1





              You're trying to explain a statistically insignificant difference. Even if you're correct, it's basically unverifiable guessing. The goal of iterative design is to make incremental improvements, not prove one iteration better than another.

              – xiota
              2 days ago












            • I said all that already in my answer...

              – Carl Dombrowski
              2 days ago















            0














            Three scores are higher and two are lower at first. One outlier (Voting 37 instead of 90) would be enough for that. Not to mention that the very close totals mean there is not a lot of difference. Could be that the mood/whether or your participants were different.



            Though there might actually be another issue. If we take the results at face value (even if the sample size is too small), the newer version seems to lack the ability to make people love it.



            For instance, while tool tips can be helpful and actually make someone find an option faster, I personally find them very annoying, usually. They just distract unnecessarily, especially if the option would be quite easy to find without it. And they make one feel pushed in a certain direction - similar to bad advertisement.



            Also, if you omitted information which would be expected, or dumbed down things too much, people would not always be happy about that - especially if they feel that usefulness is reduced. Some people might also dislike the lack of jargon if it belongs there. Or the new colors. And so on.



            As I don't know your ui in detail, I can't say more. But if you look at both versions side by side, you might see some of the annoying parts yourself, and can try to find alternatives - small hover texts instead of tool tips one has to acknowledge, for instance.






            share|improve this answer


















            • 1





              You're trying to explain a statistically insignificant difference. Even if you're correct, it's basically unverifiable guessing. The goal of iterative design is to make incremental improvements, not prove one iteration better than another.

              – xiota
              2 days ago












            • I said all that already in my answer...

              – Carl Dombrowski
              2 days ago













            0












            0








            0







            Three scores are higher and two are lower at first. One outlier (Voting 37 instead of 90) would be enough for that. Not to mention that the very close totals mean there is not a lot of difference. Could be that the mood/whether or your participants were different.



            Though there might actually be another issue. If we take the results at face value (even if the sample size is too small), the newer version seems to lack the ability to make people love it.



            For instance, while tool tips can be helpful and actually make someone find an option faster, I personally find them very annoying, usually. They just distract unnecessarily, especially if the option would be quite easy to find without it. And they make one feel pushed in a certain direction - similar to bad advertisement.



            Also, if you omitted information which would be expected, or dumbed down things too much, people would not always be happy about that - especially if they feel that usefulness is reduced. Some people might also dislike the lack of jargon if it belongs there. Or the new colors. And so on.



            As I don't know your ui in detail, I can't say more. But if you look at both versions side by side, you might see some of the annoying parts yourself, and can try to find alternatives - small hover texts instead of tool tips one has to acknowledge, for instance.






            share|improve this answer













            Three scores are higher and two are lower at first. One outlier (Voting 37 instead of 90) would be enough for that. Not to mention that the very close totals mean there is not a lot of difference. Could be that the mood/whether or your participants were different.



            Though there might actually be another issue. If we take the results at face value (even if the sample size is too small), the newer version seems to lack the ability to make people love it.



            For instance, while tool tips can be helpful and actually make someone find an option faster, I personally find them very annoying, usually. They just distract unnecessarily, especially if the option would be quite easy to find without it. And they make one feel pushed in a certain direction - similar to bad advertisement.



            Also, if you omitted information which would be expected, or dumbed down things too much, people would not always be happy about that - especially if they feel that usefulness is reduced. Some people might also dislike the lack of jargon if it belongs there. Or the new colors. And so on.



            As I don't know your ui in detail, I can't say more. But if you look at both versions side by side, you might see some of the annoying parts yourself, and can try to find alternatives - small hover texts instead of tool tips one has to acknowledge, for instance.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered May 12 at 22:58









            Carl DombrowskiCarl Dombrowski

            1342




            1342







            • 1





              You're trying to explain a statistically insignificant difference. Even if you're correct, it's basically unverifiable guessing. The goal of iterative design is to make incremental improvements, not prove one iteration better than another.

              – xiota
              2 days ago












            • I said all that already in my answer...

              – Carl Dombrowski
              2 days ago












            • 1





              You're trying to explain a statistically insignificant difference. Even if you're correct, it's basically unverifiable guessing. The goal of iterative design is to make incremental improvements, not prove one iteration better than another.

              – xiota
              2 days ago












            • I said all that already in my answer...

              – Carl Dombrowski
              2 days ago







            1




            1





            You're trying to explain a statistically insignificant difference. Even if you're correct, it's basically unverifiable guessing. The goal of iterative design is to make incremental improvements, not prove one iteration better than another.

            – xiota
            2 days ago






            You're trying to explain a statistically insignificant difference. Even if you're correct, it's basically unverifiable guessing. The goal of iterative design is to make incremental improvements, not prove one iteration better than another.

            – xiota
            2 days ago














            I said all that already in my answer...

            – Carl Dombrowski
            2 days ago





            I said all that already in my answer...

            – Carl Dombrowski
            2 days ago

















            draft saved

            draft discarded
















































            Thanks for contributing an answer to User Experience Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fux.stackexchange.com%2fquestions%2f125597%2fwhat-to-do-if-sus-scores-contradict-qualitative-feedback%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

            Circuit construction for execution of conditional statements using least significant bitHow are two different registers being used as “control”?How exactly is the stated composite state of the two registers being produced using the $R_zz$ controlled rotations?Efficiently performing controlled rotations in HHLWould this quantum algorithm implementation work?How to prepare a superposed states of odd integers from $1$ to $sqrtN$?Why is this implementation of the order finding algorithm not working?Circuit construction for Hamiltonian simulationHow can I invert the least significant bit of a certain term of a superposed state?Implementing an oracleImplementing a controlled sum operation

            Magento 2 “No Payment Methods” in Admin New OrderHow to integrate Paypal Express Checkout with the Magento APIMagento 1.5 - Sales > Order > edit order and shipping methods disappearAuto Invoice Check/Money Order Payment methodAdd more simple payment methods?Shipping methods not showingWhat should I do to change payment methods if changing the configuration has no effects?1.9 - No Payment Methods showing upMy Payment Methods not Showing for downloadable/virtual product when checkout?Magento2 API to access internal payment methodHow to call an existing payment methods in the registration form?