Interpretation of non-significant results as “trends”Contradiction between significant effect in multiple regression, but non-significant t-test on its ownAnova repeated measures is significant, but all the multiple comparisons with Bonferroni correction are not?Reduced models after finding significant interactionsSignificant difference between two groups in a right /wrong quizStatistical significance test: One way Anova and Kruskal-Wallis testRandomly selected Schools vs. Students - difference in “significant” comparisonsInterpretation of p-value; difference between p=.06 and p=.99Statistical Significance Test (Genotypes and Race)What is the implication of a statistically significant pretest score in a quasi experimental nonequivalent group design?Statistical test for dependent data- comparing groups of pairs
How might the United Kingdom become a republic?
Do native speakers use ZVE or CPU?
Is this floating-point optimization allowed?
Is `curl something | sudo bash -` a reasonably safe installation method?
Are there any double stars that I can actually see orbit each other?
Supporting developers who insist on using their pet language
Should you avoid redundant information after dialogue?
When to finally reveal plot twist to characters?
How do I write a romance that doesn't look obvious
Why is the total number of hard disk sectors shown in fdisk not the same as theoretical calculation?
Installing ubuntu with HD + SSD
CPU overheating in Ubuntu 18.04
In which ways do anagamis still experience ignorance?
What does `[$'rn']` mean?
How do Windows version numbers work?
Won 50K! Now what should I do with it
Why does the trade federation become so alarmed upon learning the ambassadors are Jedi Knights?
Professor falsely accusing me of cheating in a class he does not teach, two months after end of the class. What precautions should I take?
How can I deal with a player trying to insert real-world mythology into my homebrew setting?
Alternatives to using writing paper for writing practice
QGIS Linestring rendering curves between vertex
Bob's unnecessary trip to the shops
Find values of x so that the matrix is invertible
Draw 3D Cubes around centre
Interpretation of non-significant results as “trends”
Contradiction between significant effect in multiple regression, but non-significant t-test on its ownAnova repeated measures is significant, but all the multiple comparisons with Bonferroni correction are not?Reduced models after finding significant interactionsSignificant difference between two groups in a right /wrong quizStatistical significance test: One way Anova and Kruskal-Wallis testRandomly selected Schools vs. Students - difference in “significant” comparisonsInterpretation of p-value; difference between p=.06 and p=.99Statistical Significance Test (Genotypes and Race)What is the implication of a statistically significant pretest score in a quasi experimental nonequivalent group design?Statistical test for dependent data- comparing groups of pairs
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
Recently, two different co-workers have used a kind of argument about differences between conditions that seems incorrect to me. Both of these co-workers use statistics, but they are not statisticians. I am a novice in statistics.
In both cases, I argued that, because there was no significant difference between two conditions in an experiment, it was incorrect to make a general claim about these groups with regard to the manipulation. Note that "making a general claim" means something like writing: "Group A used X more often than group B".
My co-workers retorted with: "even though there is no significant difference, the trend is still there" and "even though there is no significant difference, there is still a difference". To me, both of these sound like an equivocation, i.e., they changed the meaning of "difference" from: "a difference that is likely to be the result of something other than chance" (i.e., statistical significance), to "any non-zero difference in measurement between groups".
Was the response of my co-workers correct? I did not take it up with them because they outrank me.
statistical-significance
$endgroup$
add a comment |
$begingroup$
Recently, two different co-workers have used a kind of argument about differences between conditions that seems incorrect to me. Both of these co-workers use statistics, but they are not statisticians. I am a novice in statistics.
In both cases, I argued that, because there was no significant difference between two conditions in an experiment, it was incorrect to make a general claim about these groups with regard to the manipulation. Note that "making a general claim" means something like writing: "Group A used X more often than group B".
My co-workers retorted with: "even though there is no significant difference, the trend is still there" and "even though there is no significant difference, there is still a difference". To me, both of these sound like an equivocation, i.e., they changed the meaning of "difference" from: "a difference that is likely to be the result of something other than chance" (i.e., statistical significance), to "any non-zero difference in measurement between groups".
Was the response of my co-workers correct? I did not take it up with them because they outrank me.
statistical-significance
$endgroup$
$begingroup$
I found these articles helpful Still Not Significant and Marginally Signficant
$endgroup$
– user20637
Jul 6 at 19:03
add a comment |
$begingroup$
Recently, two different co-workers have used a kind of argument about differences between conditions that seems incorrect to me. Both of these co-workers use statistics, but they are not statisticians. I am a novice in statistics.
In both cases, I argued that, because there was no significant difference between two conditions in an experiment, it was incorrect to make a general claim about these groups with regard to the manipulation. Note that "making a general claim" means something like writing: "Group A used X more often than group B".
My co-workers retorted with: "even though there is no significant difference, the trend is still there" and "even though there is no significant difference, there is still a difference". To me, both of these sound like an equivocation, i.e., they changed the meaning of "difference" from: "a difference that is likely to be the result of something other than chance" (i.e., statistical significance), to "any non-zero difference in measurement between groups".
Was the response of my co-workers correct? I did not take it up with them because they outrank me.
statistical-significance
$endgroup$
Recently, two different co-workers have used a kind of argument about differences between conditions that seems incorrect to me. Both of these co-workers use statistics, but they are not statisticians. I am a novice in statistics.
In both cases, I argued that, because there was no significant difference between two conditions in an experiment, it was incorrect to make a general claim about these groups with regard to the manipulation. Note that "making a general claim" means something like writing: "Group A used X more often than group B".
My co-workers retorted with: "even though there is no significant difference, the trend is still there" and "even though there is no significant difference, there is still a difference". To me, both of these sound like an equivocation, i.e., they changed the meaning of "difference" from: "a difference that is likely to be the result of something other than chance" (i.e., statistical significance), to "any non-zero difference in measurement between groups".
Was the response of my co-workers correct? I did not take it up with them because they outrank me.
statistical-significance
statistical-significance
asked Jul 5 at 6:59
amdexamdex
784 bronze badges
784 bronze badges
$begingroup$
I found these articles helpful Still Not Significant and Marginally Signficant
$endgroup$
– user20637
Jul 6 at 19:03
add a comment |
$begingroup$
I found these articles helpful Still Not Significant and Marginally Signficant
$endgroup$
– user20637
Jul 6 at 19:03
$begingroup$
I found these articles helpful Still Not Significant and Marginally Signficant
$endgroup$
– user20637
Jul 6 at 19:03
$begingroup$
I found these articles helpful Still Not Significant and Marginally Signficant
$endgroup$
– user20637
Jul 6 at 19:03
add a comment |
5 Answers
5
active
oldest
votes
$begingroup$
This is a great question; the answer depends a lot on context.
In general I would say you are right: making an unqualified general claim like "group A used X more often than group B" is misleading. It would be better to say something like
in our experiment group A used X more often than group B, but we're very uncertain how this will play out in the general population
or
although group A used X 13% more often than group B in our experiment, our estimate of the difference in the general population is not clear: the plausible values range from A using X 5% less often than group B to A using X 21% more often than group B
or
group A used X 13% more often than group B, but the difference was not statistically significant (95% CI -5% to 21%; p=0.75)
On the other hand: your co-workers are right that in this particular experiment, group A used X more often than group B. However, people rarely care about the participants in a particular experiment; they want to know how your results will generalize to a larger population, and in this case the general answer is that you can't say with confidence whether a randomly selected group A will use X more or less often than a randomly selected group B.
If you needed to make a choice today about whether to use treatment A or treatment B to increase the usage of X, in the absence of any other information or differences in costs etc., then choosing A would be your best bet. But if you wanted be comfortable that you were probably making the right choice, you would need more information.
Note that you should not say "there is no difference between group A and group B in their usage of X", or "group A and group B use X the same amount". This is true neither of the participants in your experiment (where A used X 13% more) or in the general population; in most real-world contexts, you know that there must really be some effect (no matter how slight) of A vs. B; you just don't know which direction it goes.
$endgroup$
4
$begingroup$
Beautiful response, Ben! I wonder if your second example statement could be modified for clarity to reflect the gist of the first example statement: "although group A used X 13% more often than group B IN OUR EXPERIMENT, the difference IN USAGE OF X BETWEEN GROUPS IN THE GENERAL POPULATION was not clear: the plausible range OF THAT DIFFERENCE went from A using X 5% less often than group B to A using X 21% more often than group B."
$endgroup$
– Isabella Ghement
Jul 5 at 15:34
3
$begingroup$
thanks, partially incorporated (trying to balance brevity/clarity and accuracy ...)
$endgroup$
– Ben Bolker
Jul 5 at 23:00
7
$begingroup$
+1 I think many people fail to realize that in the absence of statistical evidence, the observed differences may very well be the opposite of what’s going on with the population!
$endgroup$
– Dave
Jul 5 at 23:51
$begingroup$
@Dave: even if the presence of "statistical evidence" (statistically significant p-value?), "the observed differences may very well be the opposite of what’s going on with the population"
$endgroup$
– boscovich
Jul 7 at 17:57
$begingroup$
@boscovich Sure, I was talking in absolutes when we’re doing statistics, but I think of it as an insignificant p-value meaning that you really haven’t a clue what’s happening with the population. At least with a significant p-value you have reached some established threshold of evidence to suggest that you know something. But definitely it’s possible to get a significant p-value when it’s misidentified the direction. That error should happen from time to time.
$endgroup$
– Dave
Jul 7 at 18:05
|
show 1 more comment
$begingroup$
That's a tough question!
First things first, any threshold you may choose to determine statistical significance is arbitrary. The fact that most people use a $5%$ $p$-value does not make it more correct than any other. So, in some sense, you should think of statistical significance as a "spectrum" rather than a black-or-white subject.
Let's assume we have a null hypothesis $H_0$ (for example, groups $A$ and $B$ show the same mean for variable $X$, or the population mean for variable $Y$ is below 5). You can think of the null hypothesis as the "no trend" hypothesis. We gather some data to check whether we can disprove $H_0$ (the null hypothesis is never "proved true"). With our sample, we make some statistics and eventually get a $p$-value. Put shortly, the $p$-value is the probability that pure chance would produce results equally (or more) extreme than those we got, assuming of course $H_0$ to be true (i.e., no trend).
If we get a "low" $p$-value, we say that chance rarely produces results as those, therefore we reject $H_0$ (there's statistically significant evidence that $H_0$ could be false). If we get a "high" $p$-value, then the results are more likely to be a result of luck, rather than actual trend. We don't say $H_0$ is true, but rather, that further studying should take place in order to reject it.
WARNING: A $p$-value of $23%$ does not mean that there is a $23%$ chance of there not being any trend, but rather, that chance generates results as those $23%$ of the time, which sounds similar, but is a completely different thing. For example, if I claim something ridiculous, like "I can predict results of rolling dice an hour before they take place," we make an experiment to check the null hypothesis $H_0:=$"I cannot do such thing" and get a $0.5%$ $p-$value, you would still have good reason not to believe me, despite the statistical significance.
So, with these ideas in mind, let's go back to your main question. Let's say we want to check if increasing the dose of drug $X$ has an effect on the likelihood of patients that survive a certain disease. We perform an experiment, fit a logistic regression model (taking into account many other variables) and check for significance on the coefficient associated with the "dose" variable (calling that coefficient $beta$, we'd test a null hypothesis $H_0:$ $beta=0$ or maybe, $beta leq 0$. In English, "the drug has no effect" or "the drug has either no or negative effect."
The results of the experiment throw a positive beta, but the test $beta=0$ stays at 0.79. Can we say there is a trend? Well, that would really diminish the meaning of "trend". If we accept that kind of thing, basically half of all experiments we make would show "trends," even when testing for the most ridiculous things.
So, in conclusion, I think it is dishonest to claim that our drug makes any difference. What we should say, instead, is that our drug should not be put into production unless further testing is made. Indeed, my say would be that we should still be careful about the claims we make even when statistical significance is reached. Would you take that drug if chance had a $4%$ of generating those results? This is why research replication and peer-reviewing is critical.
I hope this too-wordy explanation helps you sort your ideas. The summary is that you are absolutely right! We shouldn't fill our reports, whether it's for research, business, or whatever, with wild claims supported by little evidence. If you really think there is a trend, but you didn't reach statistical significance, then repeat the experiment with more data!
$endgroup$
1
$begingroup$
+1 for pointing out that any significance threshold is arbitrary (and by implication it is not possible to infer absolute claims about the general population from the results in a sample -- all you get are better probabilities).
$endgroup$
– Peter A. Schneider
Jul 7 at 10:19
add a comment |
$begingroup$
Significant effect just means that you measured an unlikely anomaly (unlikely if the null hypothesis, absence of effect, would be true). And as a consequence it must be doubted with high probability (although this probability is not equal to the p-value and also depends on prior believes).
Depending on the quality of the experiment you could measure the same effect size, but it might not be an anomaly (not an unlikely result if the null hypothesis would be true).
When you observe an effect but it is not significant then indeed it (the effect) can still be there, but it is only not significant (the measurements do not indicate that the null hypothesis should be doubted/rejected with high probability). It means that you should improve your experiment, gather more data, to be more sure.
So instead of the dichotomy effect versus no-effect you should go for the following four categories:
Image from https://en.wikipedia.org/wiki/Equivalence_test explaining the two one sided t-tests procedure (TOST)
You seem to be in category D, the test is inconclusive. Your coworkers might be wrong to say that there is an effect. However, it is equally wrong to say that there is no effect!
$endgroup$
$begingroup$
"Significant effect just means that you measured the null hypothesis (absence of effect) must be doubted with high probability." I strongly disagree with this statement. What if I told you I can predict the result of any coin flip, we make an experiment, and out of pure luck we get a 1% $p$-value? Would you say there is a high probability of the null hypothesis being false?
$endgroup$
– David
Jul 5 at 8:28
$begingroup$
@David, I completely agree with you that the p-value is more precisely a measure for 'the probability that we make an error conditional that the null hypothesis is true' (or the probability to see such extreme results), and it does not express directly 'the probabilty that the null hypothesis is wrong'. However, I feel that the p-value is not meant to be to be used in this 'official' sense. The p-value is used to express doubt in the null hypothesis, to express that the results indicate an anomaly and anomalies should make us doubt the null....
$endgroup$
– Martijn Weterings
Jul 5 at 16:09
$begingroup$
....in your case, when you show to challenge the null effect (challenge the idea that one can not predict the coins) by providing a rare case (just like the tea tasting lady) then we should indeed have doubt in the null hypothesis. In practice we would need to set an appropriate p-value for this (since indeed one might challenge the null by mere chance), and I would not use the 1% level. The high probability to doubt the null should not be equated, one-to-one, with the p-value (since that probability is more a Bayesian concept).
$endgroup$
– Martijn Weterings
Jul 5 at 16:16
$begingroup$
I have adapted the text to take away this misinterpretation.
$endgroup$
– Martijn Weterings
Jul 5 at 16:19
add a comment |
$begingroup$
It sounds like they're arguing p-value vs. the definition of "Trend".
If you plot the data out on a run chart, you may see a trend... a run of plot points that show a trend going up or down over time.
But, when you do the statistics on it.. the p-value suggests it's not significant.
For the p-value to show little significance, but for them to see a trend / run in the series of data ... that would have to be a very slight trend.
So, if that was the case, I would fall back on the p-value.. IE: ok, yes, there's a trend / run in the data.. but it's so slight and insignificant that the statistics suggest it's not worth pursuing further analysis of.
An insignificant trend is something that may be attributable to some kind of bias in the research.. maybe something very minor.. something that may just be a one time occurence in the experiment that happened to create a slight trend.
If I was the manager of the group, I would tell them to stop wasting time and money digging into insignificant trends, and to look for more significant ones.
$endgroup$
add a comment |
$begingroup$
It sounds like in this case they have little justification for their claim and are just abusing statistics to reach the conclusion they already had. But there are times when it's ok to not be so strict with p-val cutoffs. This (how to use statistical significance and pval cutoffs) is a debate that has been raging since Fisher, Neyman, and Pearson first laid the foundations of statistical testing.
Let's say you are building a model and you are deciding what variables in include. You gather a little bit of data to do some preliminary investigation into potential variables. Now there's this one variable that the business team really is interested in, but your preliminary investigation shows that the variable is not statistically significant. However, the 'direction' of the variable comports to what the business team expected, and although it didn't meet the threshold for significance, it was close. Perhaps it was suspected to have positive correlation to the outcome and you got a beta coefficient that was positive but the pval was just a bit above the .05 cutoff.
In that case, you might go ahead and include it. It's sort of an informal bayesian statistics -- there was a strong prior belief that it is a useful variable and the initial investigation into it showed some evidence in that direction (but not statistically significant evidence!) so you give it the benefit of the doubt and keep it in the model. Perhaps with more data it will be more evident what relationship it has with the outcome of interest.
Another example might be where you are building a new model and you look at the variables that were used in the previous model -- you might continue to include a marginal variable (one that is on the cusp of significance) to maintain some continuity from model to model.
Basically, depending on what you are doing there are reasons to be more and less strict about these sorts of things.
On the other hand, it's also important to keep in mind that statistical significance does not have to imply a practical significance! Remember that at the heart of all this is sample size. Collect enough data and the standard error of the estimate will shrink to 0. This will make any sort of difference, no matter how small, 'statistically significant' even if that difference might not amount to anything in the real world. For example, suppose the probability of a particular coin landing on heads was .500000000000001. This means that theoretically you could design an experiment which concludes that the coin is not fair, but for all intents and purposes the coin could be treated as a fair coin.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f416129%2finterpretation-of-non-significant-results-as-trends%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
This is a great question; the answer depends a lot on context.
In general I would say you are right: making an unqualified general claim like "group A used X more often than group B" is misleading. It would be better to say something like
in our experiment group A used X more often than group B, but we're very uncertain how this will play out in the general population
or
although group A used X 13% more often than group B in our experiment, our estimate of the difference in the general population is not clear: the plausible values range from A using X 5% less often than group B to A using X 21% more often than group B
or
group A used X 13% more often than group B, but the difference was not statistically significant (95% CI -5% to 21%; p=0.75)
On the other hand: your co-workers are right that in this particular experiment, group A used X more often than group B. However, people rarely care about the participants in a particular experiment; they want to know how your results will generalize to a larger population, and in this case the general answer is that you can't say with confidence whether a randomly selected group A will use X more or less often than a randomly selected group B.
If you needed to make a choice today about whether to use treatment A or treatment B to increase the usage of X, in the absence of any other information or differences in costs etc., then choosing A would be your best bet. But if you wanted be comfortable that you were probably making the right choice, you would need more information.
Note that you should not say "there is no difference between group A and group B in their usage of X", or "group A and group B use X the same amount". This is true neither of the participants in your experiment (where A used X 13% more) or in the general population; in most real-world contexts, you know that there must really be some effect (no matter how slight) of A vs. B; you just don't know which direction it goes.
$endgroup$
4
$begingroup$
Beautiful response, Ben! I wonder if your second example statement could be modified for clarity to reflect the gist of the first example statement: "although group A used X 13% more often than group B IN OUR EXPERIMENT, the difference IN USAGE OF X BETWEEN GROUPS IN THE GENERAL POPULATION was not clear: the plausible range OF THAT DIFFERENCE went from A using X 5% less often than group B to A using X 21% more often than group B."
$endgroup$
– Isabella Ghement
Jul 5 at 15:34
3
$begingroup$
thanks, partially incorporated (trying to balance brevity/clarity and accuracy ...)
$endgroup$
– Ben Bolker
Jul 5 at 23:00
7
$begingroup$
+1 I think many people fail to realize that in the absence of statistical evidence, the observed differences may very well be the opposite of what’s going on with the population!
$endgroup$
– Dave
Jul 5 at 23:51
$begingroup$
@Dave: even if the presence of "statistical evidence" (statistically significant p-value?), "the observed differences may very well be the opposite of what’s going on with the population"
$endgroup$
– boscovich
Jul 7 at 17:57
$begingroup$
@boscovich Sure, I was talking in absolutes when we’re doing statistics, but I think of it as an insignificant p-value meaning that you really haven’t a clue what’s happening with the population. At least with a significant p-value you have reached some established threshold of evidence to suggest that you know something. But definitely it’s possible to get a significant p-value when it’s misidentified the direction. That error should happen from time to time.
$endgroup$
– Dave
Jul 7 at 18:05
|
show 1 more comment
$begingroup$
This is a great question; the answer depends a lot on context.
In general I would say you are right: making an unqualified general claim like "group A used X more often than group B" is misleading. It would be better to say something like
in our experiment group A used X more often than group B, but we're very uncertain how this will play out in the general population
or
although group A used X 13% more often than group B in our experiment, our estimate of the difference in the general population is not clear: the plausible values range from A using X 5% less often than group B to A using X 21% more often than group B
or
group A used X 13% more often than group B, but the difference was not statistically significant (95% CI -5% to 21%; p=0.75)
On the other hand: your co-workers are right that in this particular experiment, group A used X more often than group B. However, people rarely care about the participants in a particular experiment; they want to know how your results will generalize to a larger population, and in this case the general answer is that you can't say with confidence whether a randomly selected group A will use X more or less often than a randomly selected group B.
If you needed to make a choice today about whether to use treatment A or treatment B to increase the usage of X, in the absence of any other information or differences in costs etc., then choosing A would be your best bet. But if you wanted be comfortable that you were probably making the right choice, you would need more information.
Note that you should not say "there is no difference between group A and group B in their usage of X", or "group A and group B use X the same amount". This is true neither of the participants in your experiment (where A used X 13% more) or in the general population; in most real-world contexts, you know that there must really be some effect (no matter how slight) of A vs. B; you just don't know which direction it goes.
$endgroup$
4
$begingroup$
Beautiful response, Ben! I wonder if your second example statement could be modified for clarity to reflect the gist of the first example statement: "although group A used X 13% more often than group B IN OUR EXPERIMENT, the difference IN USAGE OF X BETWEEN GROUPS IN THE GENERAL POPULATION was not clear: the plausible range OF THAT DIFFERENCE went from A using X 5% less often than group B to A using X 21% more often than group B."
$endgroup$
– Isabella Ghement
Jul 5 at 15:34
3
$begingroup$
thanks, partially incorporated (trying to balance brevity/clarity and accuracy ...)
$endgroup$
– Ben Bolker
Jul 5 at 23:00
7
$begingroup$
+1 I think many people fail to realize that in the absence of statistical evidence, the observed differences may very well be the opposite of what’s going on with the population!
$endgroup$
– Dave
Jul 5 at 23:51
$begingroup$
@Dave: even if the presence of "statistical evidence" (statistically significant p-value?), "the observed differences may very well be the opposite of what’s going on with the population"
$endgroup$
– boscovich
Jul 7 at 17:57
$begingroup$
@boscovich Sure, I was talking in absolutes when we’re doing statistics, but I think of it as an insignificant p-value meaning that you really haven’t a clue what’s happening with the population. At least with a significant p-value you have reached some established threshold of evidence to suggest that you know something. But definitely it’s possible to get a significant p-value when it’s misidentified the direction. That error should happen from time to time.
$endgroup$
– Dave
Jul 7 at 18:05
|
show 1 more comment
$begingroup$
This is a great question; the answer depends a lot on context.
In general I would say you are right: making an unqualified general claim like "group A used X more often than group B" is misleading. It would be better to say something like
in our experiment group A used X more often than group B, but we're very uncertain how this will play out in the general population
or
although group A used X 13% more often than group B in our experiment, our estimate of the difference in the general population is not clear: the plausible values range from A using X 5% less often than group B to A using X 21% more often than group B
or
group A used X 13% more often than group B, but the difference was not statistically significant (95% CI -5% to 21%; p=0.75)
On the other hand: your co-workers are right that in this particular experiment, group A used X more often than group B. However, people rarely care about the participants in a particular experiment; they want to know how your results will generalize to a larger population, and in this case the general answer is that you can't say with confidence whether a randomly selected group A will use X more or less often than a randomly selected group B.
If you needed to make a choice today about whether to use treatment A or treatment B to increase the usage of X, in the absence of any other information or differences in costs etc., then choosing A would be your best bet. But if you wanted be comfortable that you were probably making the right choice, you would need more information.
Note that you should not say "there is no difference between group A and group B in their usage of X", or "group A and group B use X the same amount". This is true neither of the participants in your experiment (where A used X 13% more) or in the general population; in most real-world contexts, you know that there must really be some effect (no matter how slight) of A vs. B; you just don't know which direction it goes.
$endgroup$
This is a great question; the answer depends a lot on context.
In general I would say you are right: making an unqualified general claim like "group A used X more often than group B" is misleading. It would be better to say something like
in our experiment group A used X more often than group B, but we're very uncertain how this will play out in the general population
or
although group A used X 13% more often than group B in our experiment, our estimate of the difference in the general population is not clear: the plausible values range from A using X 5% less often than group B to A using X 21% more often than group B
or
group A used X 13% more often than group B, but the difference was not statistically significant (95% CI -5% to 21%; p=0.75)
On the other hand: your co-workers are right that in this particular experiment, group A used X more often than group B. However, people rarely care about the participants in a particular experiment; they want to know how your results will generalize to a larger population, and in this case the general answer is that you can't say with confidence whether a randomly selected group A will use X more or less often than a randomly selected group B.
If you needed to make a choice today about whether to use treatment A or treatment B to increase the usage of X, in the absence of any other information or differences in costs etc., then choosing A would be your best bet. But if you wanted be comfortable that you were probably making the right choice, you would need more information.
Note that you should not say "there is no difference between group A and group B in their usage of X", or "group A and group B use X the same amount". This is true neither of the participants in your experiment (where A used X 13% more) or in the general population; in most real-world contexts, you know that there must really be some effect (no matter how slight) of A vs. B; you just don't know which direction it goes.
edited Jul 5 at 23:00
answered Jul 5 at 8:11
Ben BolkerBen Bolker
25k2 gold badges67 silver badges96 bronze badges
25k2 gold badges67 silver badges96 bronze badges
4
$begingroup$
Beautiful response, Ben! I wonder if your second example statement could be modified for clarity to reflect the gist of the first example statement: "although group A used X 13% more often than group B IN OUR EXPERIMENT, the difference IN USAGE OF X BETWEEN GROUPS IN THE GENERAL POPULATION was not clear: the plausible range OF THAT DIFFERENCE went from A using X 5% less often than group B to A using X 21% more often than group B."
$endgroup$
– Isabella Ghement
Jul 5 at 15:34
3
$begingroup$
thanks, partially incorporated (trying to balance brevity/clarity and accuracy ...)
$endgroup$
– Ben Bolker
Jul 5 at 23:00
7
$begingroup$
+1 I think many people fail to realize that in the absence of statistical evidence, the observed differences may very well be the opposite of what’s going on with the population!
$endgroup$
– Dave
Jul 5 at 23:51
$begingroup$
@Dave: even if the presence of "statistical evidence" (statistically significant p-value?), "the observed differences may very well be the opposite of what’s going on with the population"
$endgroup$
– boscovich
Jul 7 at 17:57
$begingroup$
@boscovich Sure, I was talking in absolutes when we’re doing statistics, but I think of it as an insignificant p-value meaning that you really haven’t a clue what’s happening with the population. At least with a significant p-value you have reached some established threshold of evidence to suggest that you know something. But definitely it’s possible to get a significant p-value when it’s misidentified the direction. That error should happen from time to time.
$endgroup$
– Dave
Jul 7 at 18:05
|
show 1 more comment
4
$begingroup$
Beautiful response, Ben! I wonder if your second example statement could be modified for clarity to reflect the gist of the first example statement: "although group A used X 13% more often than group B IN OUR EXPERIMENT, the difference IN USAGE OF X BETWEEN GROUPS IN THE GENERAL POPULATION was not clear: the plausible range OF THAT DIFFERENCE went from A using X 5% less often than group B to A using X 21% more often than group B."
$endgroup$
– Isabella Ghement
Jul 5 at 15:34
3
$begingroup$
thanks, partially incorporated (trying to balance brevity/clarity and accuracy ...)
$endgroup$
– Ben Bolker
Jul 5 at 23:00
7
$begingroup$
+1 I think many people fail to realize that in the absence of statistical evidence, the observed differences may very well be the opposite of what’s going on with the population!
$endgroup$
– Dave
Jul 5 at 23:51
$begingroup$
@Dave: even if the presence of "statistical evidence" (statistically significant p-value?), "the observed differences may very well be the opposite of what’s going on with the population"
$endgroup$
– boscovich
Jul 7 at 17:57
$begingroup$
@boscovich Sure, I was talking in absolutes when we’re doing statistics, but I think of it as an insignificant p-value meaning that you really haven’t a clue what’s happening with the population. At least with a significant p-value you have reached some established threshold of evidence to suggest that you know something. But definitely it’s possible to get a significant p-value when it’s misidentified the direction. That error should happen from time to time.
$endgroup$
– Dave
Jul 7 at 18:05
4
4
$begingroup$
Beautiful response, Ben! I wonder if your second example statement could be modified for clarity to reflect the gist of the first example statement: "although group A used X 13% more often than group B IN OUR EXPERIMENT, the difference IN USAGE OF X BETWEEN GROUPS IN THE GENERAL POPULATION was not clear: the plausible range OF THAT DIFFERENCE went from A using X 5% less often than group B to A using X 21% more often than group B."
$endgroup$
– Isabella Ghement
Jul 5 at 15:34
$begingroup$
Beautiful response, Ben! I wonder if your second example statement could be modified for clarity to reflect the gist of the first example statement: "although group A used X 13% more often than group B IN OUR EXPERIMENT, the difference IN USAGE OF X BETWEEN GROUPS IN THE GENERAL POPULATION was not clear: the plausible range OF THAT DIFFERENCE went from A using X 5% less often than group B to A using X 21% more often than group B."
$endgroup$
– Isabella Ghement
Jul 5 at 15:34
3
3
$begingroup$
thanks, partially incorporated (trying to balance brevity/clarity and accuracy ...)
$endgroup$
– Ben Bolker
Jul 5 at 23:00
$begingroup$
thanks, partially incorporated (trying to balance brevity/clarity and accuracy ...)
$endgroup$
– Ben Bolker
Jul 5 at 23:00
7
7
$begingroup$
+1 I think many people fail to realize that in the absence of statistical evidence, the observed differences may very well be the opposite of what’s going on with the population!
$endgroup$
– Dave
Jul 5 at 23:51
$begingroup$
+1 I think many people fail to realize that in the absence of statistical evidence, the observed differences may very well be the opposite of what’s going on with the population!
$endgroup$
– Dave
Jul 5 at 23:51
$begingroup$
@Dave: even if the presence of "statistical evidence" (statistically significant p-value?), "the observed differences may very well be the opposite of what’s going on with the population"
$endgroup$
– boscovich
Jul 7 at 17:57
$begingroup$
@Dave: even if the presence of "statistical evidence" (statistically significant p-value?), "the observed differences may very well be the opposite of what’s going on with the population"
$endgroup$
– boscovich
Jul 7 at 17:57
$begingroup$
@boscovich Sure, I was talking in absolutes when we’re doing statistics, but I think of it as an insignificant p-value meaning that you really haven’t a clue what’s happening with the population. At least with a significant p-value you have reached some established threshold of evidence to suggest that you know something. But definitely it’s possible to get a significant p-value when it’s misidentified the direction. That error should happen from time to time.
$endgroup$
– Dave
Jul 7 at 18:05
$begingroup$
@boscovich Sure, I was talking in absolutes when we’re doing statistics, but I think of it as an insignificant p-value meaning that you really haven’t a clue what’s happening with the population. At least with a significant p-value you have reached some established threshold of evidence to suggest that you know something. But definitely it’s possible to get a significant p-value when it’s misidentified the direction. That error should happen from time to time.
$endgroup$
– Dave
Jul 7 at 18:05
|
show 1 more comment
$begingroup$
That's a tough question!
First things first, any threshold you may choose to determine statistical significance is arbitrary. The fact that most people use a $5%$ $p$-value does not make it more correct than any other. So, in some sense, you should think of statistical significance as a "spectrum" rather than a black-or-white subject.
Let's assume we have a null hypothesis $H_0$ (for example, groups $A$ and $B$ show the same mean for variable $X$, or the population mean for variable $Y$ is below 5). You can think of the null hypothesis as the "no trend" hypothesis. We gather some data to check whether we can disprove $H_0$ (the null hypothesis is never "proved true"). With our sample, we make some statistics and eventually get a $p$-value. Put shortly, the $p$-value is the probability that pure chance would produce results equally (or more) extreme than those we got, assuming of course $H_0$ to be true (i.e., no trend).
If we get a "low" $p$-value, we say that chance rarely produces results as those, therefore we reject $H_0$ (there's statistically significant evidence that $H_0$ could be false). If we get a "high" $p$-value, then the results are more likely to be a result of luck, rather than actual trend. We don't say $H_0$ is true, but rather, that further studying should take place in order to reject it.
WARNING: A $p$-value of $23%$ does not mean that there is a $23%$ chance of there not being any trend, but rather, that chance generates results as those $23%$ of the time, which sounds similar, but is a completely different thing. For example, if I claim something ridiculous, like "I can predict results of rolling dice an hour before they take place," we make an experiment to check the null hypothesis $H_0:=$"I cannot do such thing" and get a $0.5%$ $p-$value, you would still have good reason not to believe me, despite the statistical significance.
So, with these ideas in mind, let's go back to your main question. Let's say we want to check if increasing the dose of drug $X$ has an effect on the likelihood of patients that survive a certain disease. We perform an experiment, fit a logistic regression model (taking into account many other variables) and check for significance on the coefficient associated with the "dose" variable (calling that coefficient $beta$, we'd test a null hypothesis $H_0:$ $beta=0$ or maybe, $beta leq 0$. In English, "the drug has no effect" or "the drug has either no or negative effect."
The results of the experiment throw a positive beta, but the test $beta=0$ stays at 0.79. Can we say there is a trend? Well, that would really diminish the meaning of "trend". If we accept that kind of thing, basically half of all experiments we make would show "trends," even when testing for the most ridiculous things.
So, in conclusion, I think it is dishonest to claim that our drug makes any difference. What we should say, instead, is that our drug should not be put into production unless further testing is made. Indeed, my say would be that we should still be careful about the claims we make even when statistical significance is reached. Would you take that drug if chance had a $4%$ of generating those results? This is why research replication and peer-reviewing is critical.
I hope this too-wordy explanation helps you sort your ideas. The summary is that you are absolutely right! We shouldn't fill our reports, whether it's for research, business, or whatever, with wild claims supported by little evidence. If you really think there is a trend, but you didn't reach statistical significance, then repeat the experiment with more data!
$endgroup$
1
$begingroup$
+1 for pointing out that any significance threshold is arbitrary (and by implication it is not possible to infer absolute claims about the general population from the results in a sample -- all you get are better probabilities).
$endgroup$
– Peter A. Schneider
Jul 7 at 10:19
add a comment |
$begingroup$
That's a tough question!
First things first, any threshold you may choose to determine statistical significance is arbitrary. The fact that most people use a $5%$ $p$-value does not make it more correct than any other. So, in some sense, you should think of statistical significance as a "spectrum" rather than a black-or-white subject.
Let's assume we have a null hypothesis $H_0$ (for example, groups $A$ and $B$ show the same mean for variable $X$, or the population mean for variable $Y$ is below 5). You can think of the null hypothesis as the "no trend" hypothesis. We gather some data to check whether we can disprove $H_0$ (the null hypothesis is never "proved true"). With our sample, we make some statistics and eventually get a $p$-value. Put shortly, the $p$-value is the probability that pure chance would produce results equally (or more) extreme than those we got, assuming of course $H_0$ to be true (i.e., no trend).
If we get a "low" $p$-value, we say that chance rarely produces results as those, therefore we reject $H_0$ (there's statistically significant evidence that $H_0$ could be false). If we get a "high" $p$-value, then the results are more likely to be a result of luck, rather than actual trend. We don't say $H_0$ is true, but rather, that further studying should take place in order to reject it.
WARNING: A $p$-value of $23%$ does not mean that there is a $23%$ chance of there not being any trend, but rather, that chance generates results as those $23%$ of the time, which sounds similar, but is a completely different thing. For example, if I claim something ridiculous, like "I can predict results of rolling dice an hour before they take place," we make an experiment to check the null hypothesis $H_0:=$"I cannot do such thing" and get a $0.5%$ $p-$value, you would still have good reason not to believe me, despite the statistical significance.
So, with these ideas in mind, let's go back to your main question. Let's say we want to check if increasing the dose of drug $X$ has an effect on the likelihood of patients that survive a certain disease. We perform an experiment, fit a logistic regression model (taking into account many other variables) and check for significance on the coefficient associated with the "dose" variable (calling that coefficient $beta$, we'd test a null hypothesis $H_0:$ $beta=0$ or maybe, $beta leq 0$. In English, "the drug has no effect" or "the drug has either no or negative effect."
The results of the experiment throw a positive beta, but the test $beta=0$ stays at 0.79. Can we say there is a trend? Well, that would really diminish the meaning of "trend". If we accept that kind of thing, basically half of all experiments we make would show "trends," even when testing for the most ridiculous things.
So, in conclusion, I think it is dishonest to claim that our drug makes any difference. What we should say, instead, is that our drug should not be put into production unless further testing is made. Indeed, my say would be that we should still be careful about the claims we make even when statistical significance is reached. Would you take that drug if chance had a $4%$ of generating those results? This is why research replication and peer-reviewing is critical.
I hope this too-wordy explanation helps you sort your ideas. The summary is that you are absolutely right! We shouldn't fill our reports, whether it's for research, business, or whatever, with wild claims supported by little evidence. If you really think there is a trend, but you didn't reach statistical significance, then repeat the experiment with more data!
$endgroup$
1
$begingroup$
+1 for pointing out that any significance threshold is arbitrary (and by implication it is not possible to infer absolute claims about the general population from the results in a sample -- all you get are better probabilities).
$endgroup$
– Peter A. Schneider
Jul 7 at 10:19
add a comment |
$begingroup$
That's a tough question!
First things first, any threshold you may choose to determine statistical significance is arbitrary. The fact that most people use a $5%$ $p$-value does not make it more correct than any other. So, in some sense, you should think of statistical significance as a "spectrum" rather than a black-or-white subject.
Let's assume we have a null hypothesis $H_0$ (for example, groups $A$ and $B$ show the same mean for variable $X$, or the population mean for variable $Y$ is below 5). You can think of the null hypothesis as the "no trend" hypothesis. We gather some data to check whether we can disprove $H_0$ (the null hypothesis is never "proved true"). With our sample, we make some statistics and eventually get a $p$-value. Put shortly, the $p$-value is the probability that pure chance would produce results equally (or more) extreme than those we got, assuming of course $H_0$ to be true (i.e., no trend).
If we get a "low" $p$-value, we say that chance rarely produces results as those, therefore we reject $H_0$ (there's statistically significant evidence that $H_0$ could be false). If we get a "high" $p$-value, then the results are more likely to be a result of luck, rather than actual trend. We don't say $H_0$ is true, but rather, that further studying should take place in order to reject it.
WARNING: A $p$-value of $23%$ does not mean that there is a $23%$ chance of there not being any trend, but rather, that chance generates results as those $23%$ of the time, which sounds similar, but is a completely different thing. For example, if I claim something ridiculous, like "I can predict results of rolling dice an hour before they take place," we make an experiment to check the null hypothesis $H_0:=$"I cannot do such thing" and get a $0.5%$ $p-$value, you would still have good reason not to believe me, despite the statistical significance.
So, with these ideas in mind, let's go back to your main question. Let's say we want to check if increasing the dose of drug $X$ has an effect on the likelihood of patients that survive a certain disease. We perform an experiment, fit a logistic regression model (taking into account many other variables) and check for significance on the coefficient associated with the "dose" variable (calling that coefficient $beta$, we'd test a null hypothesis $H_0:$ $beta=0$ or maybe, $beta leq 0$. In English, "the drug has no effect" or "the drug has either no or negative effect."
The results of the experiment throw a positive beta, but the test $beta=0$ stays at 0.79. Can we say there is a trend? Well, that would really diminish the meaning of "trend". If we accept that kind of thing, basically half of all experiments we make would show "trends," even when testing for the most ridiculous things.
So, in conclusion, I think it is dishonest to claim that our drug makes any difference. What we should say, instead, is that our drug should not be put into production unless further testing is made. Indeed, my say would be that we should still be careful about the claims we make even when statistical significance is reached. Would you take that drug if chance had a $4%$ of generating those results? This is why research replication and peer-reviewing is critical.
I hope this too-wordy explanation helps you sort your ideas. The summary is that you are absolutely right! We shouldn't fill our reports, whether it's for research, business, or whatever, with wild claims supported by little evidence. If you really think there is a trend, but you didn't reach statistical significance, then repeat the experiment with more data!
$endgroup$
That's a tough question!
First things first, any threshold you may choose to determine statistical significance is arbitrary. The fact that most people use a $5%$ $p$-value does not make it more correct than any other. So, in some sense, you should think of statistical significance as a "spectrum" rather than a black-or-white subject.
Let's assume we have a null hypothesis $H_0$ (for example, groups $A$ and $B$ show the same mean for variable $X$, or the population mean for variable $Y$ is below 5). You can think of the null hypothesis as the "no trend" hypothesis. We gather some data to check whether we can disprove $H_0$ (the null hypothesis is never "proved true"). With our sample, we make some statistics and eventually get a $p$-value. Put shortly, the $p$-value is the probability that pure chance would produce results equally (or more) extreme than those we got, assuming of course $H_0$ to be true (i.e., no trend).
If we get a "low" $p$-value, we say that chance rarely produces results as those, therefore we reject $H_0$ (there's statistically significant evidence that $H_0$ could be false). If we get a "high" $p$-value, then the results are more likely to be a result of luck, rather than actual trend. We don't say $H_0$ is true, but rather, that further studying should take place in order to reject it.
WARNING: A $p$-value of $23%$ does not mean that there is a $23%$ chance of there not being any trend, but rather, that chance generates results as those $23%$ of the time, which sounds similar, but is a completely different thing. For example, if I claim something ridiculous, like "I can predict results of rolling dice an hour before they take place," we make an experiment to check the null hypothesis $H_0:=$"I cannot do such thing" and get a $0.5%$ $p-$value, you would still have good reason not to believe me, despite the statistical significance.
So, with these ideas in mind, let's go back to your main question. Let's say we want to check if increasing the dose of drug $X$ has an effect on the likelihood of patients that survive a certain disease. We perform an experiment, fit a logistic regression model (taking into account many other variables) and check for significance on the coefficient associated with the "dose" variable (calling that coefficient $beta$, we'd test a null hypothesis $H_0:$ $beta=0$ or maybe, $beta leq 0$. In English, "the drug has no effect" or "the drug has either no or negative effect."
The results of the experiment throw a positive beta, but the test $beta=0$ stays at 0.79. Can we say there is a trend? Well, that would really diminish the meaning of "trend". If we accept that kind of thing, basically half of all experiments we make would show "trends," even when testing for the most ridiculous things.
So, in conclusion, I think it is dishonest to claim that our drug makes any difference. What we should say, instead, is that our drug should not be put into production unless further testing is made. Indeed, my say would be that we should still be careful about the claims we make even when statistical significance is reached. Would you take that drug if chance had a $4%$ of generating those results? This is why research replication and peer-reviewing is critical.
I hope this too-wordy explanation helps you sort your ideas. The summary is that you are absolutely right! We shouldn't fill our reports, whether it's for research, business, or whatever, with wild claims supported by little evidence. If you really think there is a trend, but you didn't reach statistical significance, then repeat the experiment with more data!
edited Jul 5 at 18:55
Mihai Chelaru
3012 silver badges10 bronze badges
3012 silver badges10 bronze badges
answered Jul 5 at 7:58
DavidDavid
1
1
1
$begingroup$
+1 for pointing out that any significance threshold is arbitrary (and by implication it is not possible to infer absolute claims about the general population from the results in a sample -- all you get are better probabilities).
$endgroup$
– Peter A. Schneider
Jul 7 at 10:19
add a comment |
1
$begingroup$
+1 for pointing out that any significance threshold is arbitrary (and by implication it is not possible to infer absolute claims about the general population from the results in a sample -- all you get are better probabilities).
$endgroup$
– Peter A. Schneider
Jul 7 at 10:19
1
1
$begingroup$
+1 for pointing out that any significance threshold is arbitrary (and by implication it is not possible to infer absolute claims about the general population from the results in a sample -- all you get are better probabilities).
$endgroup$
– Peter A. Schneider
Jul 7 at 10:19
$begingroup$
+1 for pointing out that any significance threshold is arbitrary (and by implication it is not possible to infer absolute claims about the general population from the results in a sample -- all you get are better probabilities).
$endgroup$
– Peter A. Schneider
Jul 7 at 10:19
add a comment |
$begingroup$
Significant effect just means that you measured an unlikely anomaly (unlikely if the null hypothesis, absence of effect, would be true). And as a consequence it must be doubted with high probability (although this probability is not equal to the p-value and also depends on prior believes).
Depending on the quality of the experiment you could measure the same effect size, but it might not be an anomaly (not an unlikely result if the null hypothesis would be true).
When you observe an effect but it is not significant then indeed it (the effect) can still be there, but it is only not significant (the measurements do not indicate that the null hypothesis should be doubted/rejected with high probability). It means that you should improve your experiment, gather more data, to be more sure.
So instead of the dichotomy effect versus no-effect you should go for the following four categories:
Image from https://en.wikipedia.org/wiki/Equivalence_test explaining the two one sided t-tests procedure (TOST)
You seem to be in category D, the test is inconclusive. Your coworkers might be wrong to say that there is an effect. However, it is equally wrong to say that there is no effect!
$endgroup$
$begingroup$
"Significant effect just means that you measured the null hypothesis (absence of effect) must be doubted with high probability." I strongly disagree with this statement. What if I told you I can predict the result of any coin flip, we make an experiment, and out of pure luck we get a 1% $p$-value? Would you say there is a high probability of the null hypothesis being false?
$endgroup$
– David
Jul 5 at 8:28
$begingroup$
@David, I completely agree with you that the p-value is more precisely a measure for 'the probability that we make an error conditional that the null hypothesis is true' (or the probability to see such extreme results), and it does not express directly 'the probabilty that the null hypothesis is wrong'. However, I feel that the p-value is not meant to be to be used in this 'official' sense. The p-value is used to express doubt in the null hypothesis, to express that the results indicate an anomaly and anomalies should make us doubt the null....
$endgroup$
– Martijn Weterings
Jul 5 at 16:09
$begingroup$
....in your case, when you show to challenge the null effect (challenge the idea that one can not predict the coins) by providing a rare case (just like the tea tasting lady) then we should indeed have doubt in the null hypothesis. In practice we would need to set an appropriate p-value for this (since indeed one might challenge the null by mere chance), and I would not use the 1% level. The high probability to doubt the null should not be equated, one-to-one, with the p-value (since that probability is more a Bayesian concept).
$endgroup$
– Martijn Weterings
Jul 5 at 16:16
$begingroup$
I have adapted the text to take away this misinterpretation.
$endgroup$
– Martijn Weterings
Jul 5 at 16:19
add a comment |
$begingroup$
Significant effect just means that you measured an unlikely anomaly (unlikely if the null hypothesis, absence of effect, would be true). And as a consequence it must be doubted with high probability (although this probability is not equal to the p-value and also depends on prior believes).
Depending on the quality of the experiment you could measure the same effect size, but it might not be an anomaly (not an unlikely result if the null hypothesis would be true).
When you observe an effect but it is not significant then indeed it (the effect) can still be there, but it is only not significant (the measurements do not indicate that the null hypothesis should be doubted/rejected with high probability). It means that you should improve your experiment, gather more data, to be more sure.
So instead of the dichotomy effect versus no-effect you should go for the following four categories:
Image from https://en.wikipedia.org/wiki/Equivalence_test explaining the two one sided t-tests procedure (TOST)
You seem to be in category D, the test is inconclusive. Your coworkers might be wrong to say that there is an effect. However, it is equally wrong to say that there is no effect!
$endgroup$
$begingroup$
"Significant effect just means that you measured the null hypothesis (absence of effect) must be doubted with high probability." I strongly disagree with this statement. What if I told you I can predict the result of any coin flip, we make an experiment, and out of pure luck we get a 1% $p$-value? Would you say there is a high probability of the null hypothesis being false?
$endgroup$
– David
Jul 5 at 8:28
$begingroup$
@David, I completely agree with you that the p-value is more precisely a measure for 'the probability that we make an error conditional that the null hypothesis is true' (or the probability to see such extreme results), and it does not express directly 'the probabilty that the null hypothesis is wrong'. However, I feel that the p-value is not meant to be to be used in this 'official' sense. The p-value is used to express doubt in the null hypothesis, to express that the results indicate an anomaly and anomalies should make us doubt the null....
$endgroup$
– Martijn Weterings
Jul 5 at 16:09
$begingroup$
....in your case, when you show to challenge the null effect (challenge the idea that one can not predict the coins) by providing a rare case (just like the tea tasting lady) then we should indeed have doubt in the null hypothesis. In practice we would need to set an appropriate p-value for this (since indeed one might challenge the null by mere chance), and I would not use the 1% level. The high probability to doubt the null should not be equated, one-to-one, with the p-value (since that probability is more a Bayesian concept).
$endgroup$
– Martijn Weterings
Jul 5 at 16:16
$begingroup$
I have adapted the text to take away this misinterpretation.
$endgroup$
– Martijn Weterings
Jul 5 at 16:19
add a comment |
$begingroup$
Significant effect just means that you measured an unlikely anomaly (unlikely if the null hypothesis, absence of effect, would be true). And as a consequence it must be doubted with high probability (although this probability is not equal to the p-value and also depends on prior believes).
Depending on the quality of the experiment you could measure the same effect size, but it might not be an anomaly (not an unlikely result if the null hypothesis would be true).
When you observe an effect but it is not significant then indeed it (the effect) can still be there, but it is only not significant (the measurements do not indicate that the null hypothesis should be doubted/rejected with high probability). It means that you should improve your experiment, gather more data, to be more sure.
So instead of the dichotomy effect versus no-effect you should go for the following four categories:
Image from https://en.wikipedia.org/wiki/Equivalence_test explaining the two one sided t-tests procedure (TOST)
You seem to be in category D, the test is inconclusive. Your coworkers might be wrong to say that there is an effect. However, it is equally wrong to say that there is no effect!
$endgroup$
Significant effect just means that you measured an unlikely anomaly (unlikely if the null hypothesis, absence of effect, would be true). And as a consequence it must be doubted with high probability (although this probability is not equal to the p-value and also depends on prior believes).
Depending on the quality of the experiment you could measure the same effect size, but it might not be an anomaly (not an unlikely result if the null hypothesis would be true).
When you observe an effect but it is not significant then indeed it (the effect) can still be there, but it is only not significant (the measurements do not indicate that the null hypothesis should be doubted/rejected with high probability). It means that you should improve your experiment, gather more data, to be more sure.
So instead of the dichotomy effect versus no-effect you should go for the following four categories:
Image from https://en.wikipedia.org/wiki/Equivalence_test explaining the two one sided t-tests procedure (TOST)
You seem to be in category D, the test is inconclusive. Your coworkers might be wrong to say that there is an effect. However, it is equally wrong to say that there is no effect!
edited Jul 5 at 16:23
answered Jul 5 at 8:21
Martijn WeteringsMartijn Weterings
15.1k22 silver badges65 bronze badges
15.1k22 silver badges65 bronze badges
$begingroup$
"Significant effect just means that you measured the null hypothesis (absence of effect) must be doubted with high probability." I strongly disagree with this statement. What if I told you I can predict the result of any coin flip, we make an experiment, and out of pure luck we get a 1% $p$-value? Would you say there is a high probability of the null hypothesis being false?
$endgroup$
– David
Jul 5 at 8:28
$begingroup$
@David, I completely agree with you that the p-value is more precisely a measure for 'the probability that we make an error conditional that the null hypothesis is true' (or the probability to see such extreme results), and it does not express directly 'the probabilty that the null hypothesis is wrong'. However, I feel that the p-value is not meant to be to be used in this 'official' sense. The p-value is used to express doubt in the null hypothesis, to express that the results indicate an anomaly and anomalies should make us doubt the null....
$endgroup$
– Martijn Weterings
Jul 5 at 16:09
$begingroup$
....in your case, when you show to challenge the null effect (challenge the idea that one can not predict the coins) by providing a rare case (just like the tea tasting lady) then we should indeed have doubt in the null hypothesis. In practice we would need to set an appropriate p-value for this (since indeed one might challenge the null by mere chance), and I would not use the 1% level. The high probability to doubt the null should not be equated, one-to-one, with the p-value (since that probability is more a Bayesian concept).
$endgroup$
– Martijn Weterings
Jul 5 at 16:16
$begingroup$
I have adapted the text to take away this misinterpretation.
$endgroup$
– Martijn Weterings
Jul 5 at 16:19
add a comment |
$begingroup$
"Significant effect just means that you measured the null hypothesis (absence of effect) must be doubted with high probability." I strongly disagree with this statement. What if I told you I can predict the result of any coin flip, we make an experiment, and out of pure luck we get a 1% $p$-value? Would you say there is a high probability of the null hypothesis being false?
$endgroup$
– David
Jul 5 at 8:28
$begingroup$
@David, I completely agree with you that the p-value is more precisely a measure for 'the probability that we make an error conditional that the null hypothesis is true' (or the probability to see such extreme results), and it does not express directly 'the probabilty that the null hypothesis is wrong'. However, I feel that the p-value is not meant to be to be used in this 'official' sense. The p-value is used to express doubt in the null hypothesis, to express that the results indicate an anomaly and anomalies should make us doubt the null....
$endgroup$
– Martijn Weterings
Jul 5 at 16:09
$begingroup$
....in your case, when you show to challenge the null effect (challenge the idea that one can not predict the coins) by providing a rare case (just like the tea tasting lady) then we should indeed have doubt in the null hypothesis. In practice we would need to set an appropriate p-value for this (since indeed one might challenge the null by mere chance), and I would not use the 1% level. The high probability to doubt the null should not be equated, one-to-one, with the p-value (since that probability is more a Bayesian concept).
$endgroup$
– Martijn Weterings
Jul 5 at 16:16
$begingroup$
I have adapted the text to take away this misinterpretation.
$endgroup$
– Martijn Weterings
Jul 5 at 16:19
$begingroup$
"Significant effect just means that you measured the null hypothesis (absence of effect) must be doubted with high probability." I strongly disagree with this statement. What if I told you I can predict the result of any coin flip, we make an experiment, and out of pure luck we get a 1% $p$-value? Would you say there is a high probability of the null hypothesis being false?
$endgroup$
– David
Jul 5 at 8:28
$begingroup$
"Significant effect just means that you measured the null hypothesis (absence of effect) must be doubted with high probability." I strongly disagree with this statement. What if I told you I can predict the result of any coin flip, we make an experiment, and out of pure luck we get a 1% $p$-value? Would you say there is a high probability of the null hypothesis being false?
$endgroup$
– David
Jul 5 at 8:28
$begingroup$
@David, I completely agree with you that the p-value is more precisely a measure for 'the probability that we make an error conditional that the null hypothesis is true' (or the probability to see such extreme results), and it does not express directly 'the probabilty that the null hypothesis is wrong'. However, I feel that the p-value is not meant to be to be used in this 'official' sense. The p-value is used to express doubt in the null hypothesis, to express that the results indicate an anomaly and anomalies should make us doubt the null....
$endgroup$
– Martijn Weterings
Jul 5 at 16:09
$begingroup$
@David, I completely agree with you that the p-value is more precisely a measure for 'the probability that we make an error conditional that the null hypothesis is true' (or the probability to see such extreme results), and it does not express directly 'the probabilty that the null hypothesis is wrong'. However, I feel that the p-value is not meant to be to be used in this 'official' sense. The p-value is used to express doubt in the null hypothesis, to express that the results indicate an anomaly and anomalies should make us doubt the null....
$endgroup$
– Martijn Weterings
Jul 5 at 16:09
$begingroup$
....in your case, when you show to challenge the null effect (challenge the idea that one can not predict the coins) by providing a rare case (just like the tea tasting lady) then we should indeed have doubt in the null hypothesis. In practice we would need to set an appropriate p-value for this (since indeed one might challenge the null by mere chance), and I would not use the 1% level. The high probability to doubt the null should not be equated, one-to-one, with the p-value (since that probability is more a Bayesian concept).
$endgroup$
– Martijn Weterings
Jul 5 at 16:16
$begingroup$
....in your case, when you show to challenge the null effect (challenge the idea that one can not predict the coins) by providing a rare case (just like the tea tasting lady) then we should indeed have doubt in the null hypothesis. In practice we would need to set an appropriate p-value for this (since indeed one might challenge the null by mere chance), and I would not use the 1% level. The high probability to doubt the null should not be equated, one-to-one, with the p-value (since that probability is more a Bayesian concept).
$endgroup$
– Martijn Weterings
Jul 5 at 16:16
$begingroup$
I have adapted the text to take away this misinterpretation.
$endgroup$
– Martijn Weterings
Jul 5 at 16:19
$begingroup$
I have adapted the text to take away this misinterpretation.
$endgroup$
– Martijn Weterings
Jul 5 at 16:19
add a comment |
$begingroup$
It sounds like they're arguing p-value vs. the definition of "Trend".
If you plot the data out on a run chart, you may see a trend... a run of plot points that show a trend going up or down over time.
But, when you do the statistics on it.. the p-value suggests it's not significant.
For the p-value to show little significance, but for them to see a trend / run in the series of data ... that would have to be a very slight trend.
So, if that was the case, I would fall back on the p-value.. IE: ok, yes, there's a trend / run in the data.. but it's so slight and insignificant that the statistics suggest it's not worth pursuing further analysis of.
An insignificant trend is something that may be attributable to some kind of bias in the research.. maybe something very minor.. something that may just be a one time occurence in the experiment that happened to create a slight trend.
If I was the manager of the group, I would tell them to stop wasting time and money digging into insignificant trends, and to look for more significant ones.
$endgroup$
add a comment |
$begingroup$
It sounds like they're arguing p-value vs. the definition of "Trend".
If you plot the data out on a run chart, you may see a trend... a run of plot points that show a trend going up or down over time.
But, when you do the statistics on it.. the p-value suggests it's not significant.
For the p-value to show little significance, but for them to see a trend / run in the series of data ... that would have to be a very slight trend.
So, if that was the case, I would fall back on the p-value.. IE: ok, yes, there's a trend / run in the data.. but it's so slight and insignificant that the statistics suggest it's not worth pursuing further analysis of.
An insignificant trend is something that may be attributable to some kind of bias in the research.. maybe something very minor.. something that may just be a one time occurence in the experiment that happened to create a slight trend.
If I was the manager of the group, I would tell them to stop wasting time and money digging into insignificant trends, and to look for more significant ones.
$endgroup$
add a comment |
$begingroup$
It sounds like they're arguing p-value vs. the definition of "Trend".
If you plot the data out on a run chart, you may see a trend... a run of plot points that show a trend going up or down over time.
But, when you do the statistics on it.. the p-value suggests it's not significant.
For the p-value to show little significance, but for them to see a trend / run in the series of data ... that would have to be a very slight trend.
So, if that was the case, I would fall back on the p-value.. IE: ok, yes, there's a trend / run in the data.. but it's so slight and insignificant that the statistics suggest it's not worth pursuing further analysis of.
An insignificant trend is something that may be attributable to some kind of bias in the research.. maybe something very minor.. something that may just be a one time occurence in the experiment that happened to create a slight trend.
If I was the manager of the group, I would tell them to stop wasting time and money digging into insignificant trends, and to look for more significant ones.
$endgroup$
It sounds like they're arguing p-value vs. the definition of "Trend".
If you plot the data out on a run chart, you may see a trend... a run of plot points that show a trend going up or down over time.
But, when you do the statistics on it.. the p-value suggests it's not significant.
For the p-value to show little significance, but for them to see a trend / run in the series of data ... that would have to be a very slight trend.
So, if that was the case, I would fall back on the p-value.. IE: ok, yes, there's a trend / run in the data.. but it's so slight and insignificant that the statistics suggest it's not worth pursuing further analysis of.
An insignificant trend is something that may be attributable to some kind of bias in the research.. maybe something very minor.. something that may just be a one time occurence in the experiment that happened to create a slight trend.
If I was the manager of the group, I would tell them to stop wasting time and money digging into insignificant trends, and to look for more significant ones.
answered Jul 5 at 17:56
blahblahblahblah
1
1
add a comment |
add a comment |
$begingroup$
It sounds like in this case they have little justification for their claim and are just abusing statistics to reach the conclusion they already had. But there are times when it's ok to not be so strict with p-val cutoffs. This (how to use statistical significance and pval cutoffs) is a debate that has been raging since Fisher, Neyman, and Pearson first laid the foundations of statistical testing.
Let's say you are building a model and you are deciding what variables in include. You gather a little bit of data to do some preliminary investigation into potential variables. Now there's this one variable that the business team really is interested in, but your preliminary investigation shows that the variable is not statistically significant. However, the 'direction' of the variable comports to what the business team expected, and although it didn't meet the threshold for significance, it was close. Perhaps it was suspected to have positive correlation to the outcome and you got a beta coefficient that was positive but the pval was just a bit above the .05 cutoff.
In that case, you might go ahead and include it. It's sort of an informal bayesian statistics -- there was a strong prior belief that it is a useful variable and the initial investigation into it showed some evidence in that direction (but not statistically significant evidence!) so you give it the benefit of the doubt and keep it in the model. Perhaps with more data it will be more evident what relationship it has with the outcome of interest.
Another example might be where you are building a new model and you look at the variables that were used in the previous model -- you might continue to include a marginal variable (one that is on the cusp of significance) to maintain some continuity from model to model.
Basically, depending on what you are doing there are reasons to be more and less strict about these sorts of things.
On the other hand, it's also important to keep in mind that statistical significance does not have to imply a practical significance! Remember that at the heart of all this is sample size. Collect enough data and the standard error of the estimate will shrink to 0. This will make any sort of difference, no matter how small, 'statistically significant' even if that difference might not amount to anything in the real world. For example, suppose the probability of a particular coin landing on heads was .500000000000001. This means that theoretically you could design an experiment which concludes that the coin is not fair, but for all intents and purposes the coin could be treated as a fair coin.
$endgroup$
add a comment |
$begingroup$
It sounds like in this case they have little justification for their claim and are just abusing statistics to reach the conclusion they already had. But there are times when it's ok to not be so strict with p-val cutoffs. This (how to use statistical significance and pval cutoffs) is a debate that has been raging since Fisher, Neyman, and Pearson first laid the foundations of statistical testing.
Let's say you are building a model and you are deciding what variables in include. You gather a little bit of data to do some preliminary investigation into potential variables. Now there's this one variable that the business team really is interested in, but your preliminary investigation shows that the variable is not statistically significant. However, the 'direction' of the variable comports to what the business team expected, and although it didn't meet the threshold for significance, it was close. Perhaps it was suspected to have positive correlation to the outcome and you got a beta coefficient that was positive but the pval was just a bit above the .05 cutoff.
In that case, you might go ahead and include it. It's sort of an informal bayesian statistics -- there was a strong prior belief that it is a useful variable and the initial investigation into it showed some evidence in that direction (but not statistically significant evidence!) so you give it the benefit of the doubt and keep it in the model. Perhaps with more data it will be more evident what relationship it has with the outcome of interest.
Another example might be where you are building a new model and you look at the variables that were used in the previous model -- you might continue to include a marginal variable (one that is on the cusp of significance) to maintain some continuity from model to model.
Basically, depending on what you are doing there are reasons to be more and less strict about these sorts of things.
On the other hand, it's also important to keep in mind that statistical significance does not have to imply a practical significance! Remember that at the heart of all this is sample size. Collect enough data and the standard error of the estimate will shrink to 0. This will make any sort of difference, no matter how small, 'statistically significant' even if that difference might not amount to anything in the real world. For example, suppose the probability of a particular coin landing on heads was .500000000000001. This means that theoretically you could design an experiment which concludes that the coin is not fair, but for all intents and purposes the coin could be treated as a fair coin.
$endgroup$
add a comment |
$begingroup$
It sounds like in this case they have little justification for their claim and are just abusing statistics to reach the conclusion they already had. But there are times when it's ok to not be so strict with p-val cutoffs. This (how to use statistical significance and pval cutoffs) is a debate that has been raging since Fisher, Neyman, and Pearson first laid the foundations of statistical testing.
Let's say you are building a model and you are deciding what variables in include. You gather a little bit of data to do some preliminary investigation into potential variables. Now there's this one variable that the business team really is interested in, but your preliminary investigation shows that the variable is not statistically significant. However, the 'direction' of the variable comports to what the business team expected, and although it didn't meet the threshold for significance, it was close. Perhaps it was suspected to have positive correlation to the outcome and you got a beta coefficient that was positive but the pval was just a bit above the .05 cutoff.
In that case, you might go ahead and include it. It's sort of an informal bayesian statistics -- there was a strong prior belief that it is a useful variable and the initial investigation into it showed some evidence in that direction (but not statistically significant evidence!) so you give it the benefit of the doubt and keep it in the model. Perhaps with more data it will be more evident what relationship it has with the outcome of interest.
Another example might be where you are building a new model and you look at the variables that were used in the previous model -- you might continue to include a marginal variable (one that is on the cusp of significance) to maintain some continuity from model to model.
Basically, depending on what you are doing there are reasons to be more and less strict about these sorts of things.
On the other hand, it's also important to keep in mind that statistical significance does not have to imply a practical significance! Remember that at the heart of all this is sample size. Collect enough data and the standard error of the estimate will shrink to 0. This will make any sort of difference, no matter how small, 'statistically significant' even if that difference might not amount to anything in the real world. For example, suppose the probability of a particular coin landing on heads was .500000000000001. This means that theoretically you could design an experiment which concludes that the coin is not fair, but for all intents and purposes the coin could be treated as a fair coin.
$endgroup$
It sounds like in this case they have little justification for their claim and are just abusing statistics to reach the conclusion they already had. But there are times when it's ok to not be so strict with p-val cutoffs. This (how to use statistical significance and pval cutoffs) is a debate that has been raging since Fisher, Neyman, and Pearson first laid the foundations of statistical testing.
Let's say you are building a model and you are deciding what variables in include. You gather a little bit of data to do some preliminary investigation into potential variables. Now there's this one variable that the business team really is interested in, but your preliminary investigation shows that the variable is not statistically significant. However, the 'direction' of the variable comports to what the business team expected, and although it didn't meet the threshold for significance, it was close. Perhaps it was suspected to have positive correlation to the outcome and you got a beta coefficient that was positive but the pval was just a bit above the .05 cutoff.
In that case, you might go ahead and include it. It's sort of an informal bayesian statistics -- there was a strong prior belief that it is a useful variable and the initial investigation into it showed some evidence in that direction (but not statistically significant evidence!) so you give it the benefit of the doubt and keep it in the model. Perhaps with more data it will be more evident what relationship it has with the outcome of interest.
Another example might be where you are building a new model and you look at the variables that were used in the previous model -- you might continue to include a marginal variable (one that is on the cusp of significance) to maintain some continuity from model to model.
Basically, depending on what you are doing there are reasons to be more and less strict about these sorts of things.
On the other hand, it's also important to keep in mind that statistical significance does not have to imply a practical significance! Remember that at the heart of all this is sample size. Collect enough data and the standard error of the estimate will shrink to 0. This will make any sort of difference, no matter how small, 'statistically significant' even if that difference might not amount to anything in the real world. For example, suppose the probability of a particular coin landing on heads was .500000000000001. This means that theoretically you could design an experiment which concludes that the coin is not fair, but for all intents and purposes the coin could be treated as a fair coin.
answered Jul 6 at 14:11
epseps
193 bronze badges
193 bronze badges
add a comment |
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f416129%2finterpretation-of-non-significant-results-as-trends%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
I found these articles helpful Still Not Significant and Marginally Signficant
$endgroup$
– user20637
Jul 6 at 19:03