Big Sample size, Small coefficients, significant results. What should I do?Small and unbalanced sample sizes for two groups - what to do?Is significance of the p-value reliable with extremely small sample sizes?Pearson correlation is not significant although effect is .30 What could be the reason?Question about EFFECT SIZECorrelation - functional relationship small samples sizeWhat to do with a very big sample size? (effect size)R: linear regression: very small coefficient and R-squared but significant P valuesEstimate Sample size for ordinal logistic regressionShould I delete one year with small sample size from time series analysis?Sample size calculation Wilcoxon rank-sum test
How to design an effective polearm-bow hybrid?
How do I show and not tell a backstory?
Repeated! Factorials!
Is it possible to Clear (recover memory from) a specific index to a variable, while leaving other indices to the same variable untouched?
C# TCP server/client class
Is there any difference between "result in" and "end up with"?
Would the shaking of an earthquake be visible to somebody in a low-flying aircraft?
What is the reason behind water not falling from a bucket at the top of loop?
Is space radiation a risk for space film photography, and how is this prevented?
how to change dot to underline in multiple file-names?
The warlock of firetop mountain, what's the deal with reference 192?
Variable doesn't parse as string
Vectorised way to calculate mean of left and right neighbours in a vector
Make lens aperture in Tikz
How does Geralt transport his swords?
What's "halachic" about "Esav hates Ya'akov"?
How to call made-up data?
How do I handle a DM that plays favorites with certain players?
what can you do with Format View
Are the related objects in an SOQL query shared?
Would this winged human/angel be able to fly?
Broken bottom bracket?
Why is the Vasa Museum in Stockholm so Popular?
What could prevent players from leaving an island?
Big Sample size, Small coefficients, significant results. What should I do?
Small and unbalanced sample sizes for two groups - what to do?Is significance of the p-value reliable with extremely small sample sizes?Pearson correlation is not significant although effect is .30 What could be the reason?Question about EFFECT SIZECorrelation - functional relationship small samples sizeWhat to do with a very big sample size? (effect size)R: linear regression: very small coefficient and R-squared but significant P valuesEstimate Sample size for ordinal logistic regressionShould I delete one year with small sample size from time series analysis?Sample size calculation Wilcoxon rank-sum test
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
I did some quantitative research and I used Rank-Order logistic regression in Stata. The the independent variables have almost 0 p-value which shows they have significant effect on dependent variable. But, since the sample size is big (35000 records) and coefficients are so small (e.g. 0.0001) then it shows that there is no relationship because when sample size is so big everything can get significant.
I tested the model with only 5000 records as well and I got the significant result as well.
What do you recommend me to do? should I use small sample size then the reviewers of my paper will not point to the problem of big sample size... or is there any other way that I can report my results and show that in fact the variables have significant effect?
I will appreciate any help.
Thanks
statistical-significance regression-coefficients sample-size ordered-logit
$endgroup$
add a comment |
$begingroup$
I did some quantitative research and I used Rank-Order logistic regression in Stata. The the independent variables have almost 0 p-value which shows they have significant effect on dependent variable. But, since the sample size is big (35000 records) and coefficients are so small (e.g. 0.0001) then it shows that there is no relationship because when sample size is so big everything can get significant.
I tested the model with only 5000 records as well and I got the significant result as well.
What do you recommend me to do? should I use small sample size then the reviewers of my paper will not point to the problem of big sample size... or is there any other way that I can report my results and show that in fact the variables have significant effect?
I will appreciate any help.
Thanks
statistical-significance regression-coefficients sample-size ordered-logit
$endgroup$
$begingroup$
One question: with the coefficients that are estimated as in the range of 0.0001 and are statistically significant, what's the distribution of associated covariate? An estimate of 0.0001 is huge if the standard deviation of associated covariate is 100,000.
$endgroup$
– Cliff AB
Jul 25 at 23:37
add a comment |
$begingroup$
I did some quantitative research and I used Rank-Order logistic regression in Stata. The the independent variables have almost 0 p-value which shows they have significant effect on dependent variable. But, since the sample size is big (35000 records) and coefficients are so small (e.g. 0.0001) then it shows that there is no relationship because when sample size is so big everything can get significant.
I tested the model with only 5000 records as well and I got the significant result as well.
What do you recommend me to do? should I use small sample size then the reviewers of my paper will not point to the problem of big sample size... or is there any other way that I can report my results and show that in fact the variables have significant effect?
I will appreciate any help.
Thanks
statistical-significance regression-coefficients sample-size ordered-logit
$endgroup$
I did some quantitative research and I used Rank-Order logistic regression in Stata. The the independent variables have almost 0 p-value which shows they have significant effect on dependent variable. But, since the sample size is big (35000 records) and coefficients are so small (e.g. 0.0001) then it shows that there is no relationship because when sample size is so big everything can get significant.
I tested the model with only 5000 records as well and I got the significant result as well.
What do you recommend me to do? should I use small sample size then the reviewers of my paper will not point to the problem of big sample size... or is there any other way that I can report my results and show that in fact the variables have significant effect?
I will appreciate any help.
Thanks
statistical-significance regression-coefficients sample-size ordered-logit
statistical-significance regression-coefficients sample-size ordered-logit
asked Jul 25 at 14:39
PSSPSS
4332 gold badges8 silver badges12 bronze badges
4332 gold badges8 silver badges12 bronze badges
$begingroup$
One question: with the coefficients that are estimated as in the range of 0.0001 and are statistically significant, what's the distribution of associated covariate? An estimate of 0.0001 is huge if the standard deviation of associated covariate is 100,000.
$endgroup$
– Cliff AB
Jul 25 at 23:37
add a comment |
$begingroup$
One question: with the coefficients that are estimated as in the range of 0.0001 and are statistically significant, what's the distribution of associated covariate? An estimate of 0.0001 is huge if the standard deviation of associated covariate is 100,000.
$endgroup$
– Cliff AB
Jul 25 at 23:37
$begingroup$
One question: with the coefficients that are estimated as in the range of 0.0001 and are statistically significant, what's the distribution of associated covariate? An estimate of 0.0001 is huge if the standard deviation of associated covariate is 100,000.
$endgroup$
– Cliff AB
Jul 25 at 23:37
$begingroup$
One question: with the coefficients that are estimated as in the range of 0.0001 and are statistically significant, what's the distribution of associated covariate? An estimate of 0.0001 is huge if the standard deviation of associated covariate is 100,000.
$endgroup$
– Cliff AB
Jul 25 at 23:37
add a comment |
4 Answers
4
active
oldest
votes
$begingroup$
I think it's been asked before. It's useful to realize that, without a prespecified sample size and alpha level, the $p$-value is just a measure of the sample size you ultimately wind up with. Not appealing. An approach I use is this: at what sample size would a 0.05 level be appropriate? Scale accordingly. For instance, I feel the 0.05 level is often suited to problems where there are 100 observations. That is: I would say WOW that is an interesting finding if it had a 1/20 chance of being a false positive. So if you had a sample size of 5,000, that's 50 times larger than 100. So divide your 0.05 level by 50 and come up with 0.001 as a significance level. This is in line with what Fisher advocated: don't do significance testing with p-values, compare them to the power of the study. The sample size is the simplest/rawest measure of the study's power. An overpowered study with a conventional 0.05 cut off makes absolutely no sense.
Usually, it is never advisable to choose a significance cut-off after viewing data and results. One might believe it might be kosher to arbitrarily choose a more stringent significance criterion post hoc. Actually, it only deceives readers into thinking you ran a better controlled trial than you did. Think of it this way: if you observed p = 0.04, you wouldn't be asking this question; the analysis would be a tidy inferential package.
Another way to look at it is this: just report the CI and that the analysis was statistically significant. For instance, you might have a 95% CI for a hazard ratio that goes from (0.01, 0.16) - the null is 1. It suffices to say that the p-value is really freakin' small, so you don't need to clutter the page displaying p=0.0000000023 (don't do this... only show p out to its precision, if 3 decimal places show p < 0.001 and never round to 0.000 - that shows you don't know what a p-value means.).
$endgroup$
$begingroup$
Thanks so much for your response :)
$endgroup$
– PSS
Jul 25 at 15:01
add a comment |
$begingroup$
You have encountered the gulf between "statistically significant" and "meaningful". As you point out, with sufficient sample size, you can assign statistical significance to arbitrarily small differences - there is no difference too small that can't be called "significant" with large enough N. You need to use domain knowledge to determine what is a "meaningful" difference. You might find, for example, that a new drug increases a person's lifespan by 10 seconds - even though you can be very confident that that increase is not due to random variation in your data, it's hardly a meaningful increase in lifespan.
Some of this will come from knowing about your problem and what people in the field consider meaningful. You could also try to think of future studies that might replicate your findings, and the typical N that they might use. If future studies will likely have a much lower N, you could calculate the effect size needed to replicate your findings in data of that size, and only report significant, meaningful, and feasibly reproducible results.
$endgroup$
add a comment |
$begingroup$
When you have many samples and the observed effect is very small (small for the specified application), you can safely conclude that the independent variables do not have an important effect on the dependent variable. The effect size can be “statistically significant” and unimportant at the same time.
Using small sample size and ignoring the results from the large samples is inappropriate. You owe that to the people that read your paper and design some new experiments based on your observations.
$endgroup$
add a comment |
$begingroup$
I think you should decide on an "expected minimal effect size", i.e. the minimal coefficients you care to include in your model. Say, do you care about coefficients less than 0.0001, or 1, or 100? To clarify, the effect size is the degree to which the null hypothesis is false, or how large the coefficient actually is. It's a parameter of the population. On the other hand, the expected minimal effect size is the minimal amount of departure from the null you care to detect. It's a parameter of the test.
Now that you have the sample size $N = 35000$, as well as some expected minimal effect size, a power analysis should reveal the relationship between $alpha$ and $beta$ given there parameters. Next, make another decision about how to balance your significance level and power by choosing a pair of $alpha$ and $beta$. (Technically, all these parameters must be decided before looking at the data, but at this point, I guess you can just pretend you didn't see them.) Then, carry out your test, compare $p$ with $alpha$, and draw a conclusion accordingly.
By the way, I believe there are no reasons to exclude any record, unless you are doing cross-validation, for example. More data generally leads to more accurate inference, and additionally, discarding sample points in a selective manner may introduce bias.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f419186%2fbig-sample-size-small-coefficients-significant-results-what-should-i-do%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
4 Answers
4
active
oldest
votes
4 Answers
4
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
I think it's been asked before. It's useful to realize that, without a prespecified sample size and alpha level, the $p$-value is just a measure of the sample size you ultimately wind up with. Not appealing. An approach I use is this: at what sample size would a 0.05 level be appropriate? Scale accordingly. For instance, I feel the 0.05 level is often suited to problems where there are 100 observations. That is: I would say WOW that is an interesting finding if it had a 1/20 chance of being a false positive. So if you had a sample size of 5,000, that's 50 times larger than 100. So divide your 0.05 level by 50 and come up with 0.001 as a significance level. This is in line with what Fisher advocated: don't do significance testing with p-values, compare them to the power of the study. The sample size is the simplest/rawest measure of the study's power. An overpowered study with a conventional 0.05 cut off makes absolutely no sense.
Usually, it is never advisable to choose a significance cut-off after viewing data and results. One might believe it might be kosher to arbitrarily choose a more stringent significance criterion post hoc. Actually, it only deceives readers into thinking you ran a better controlled trial than you did. Think of it this way: if you observed p = 0.04, you wouldn't be asking this question; the analysis would be a tidy inferential package.
Another way to look at it is this: just report the CI and that the analysis was statistically significant. For instance, you might have a 95% CI for a hazard ratio that goes from (0.01, 0.16) - the null is 1. It suffices to say that the p-value is really freakin' small, so you don't need to clutter the page displaying p=0.0000000023 (don't do this... only show p out to its precision, if 3 decimal places show p < 0.001 and never round to 0.000 - that shows you don't know what a p-value means.).
$endgroup$
$begingroup$
Thanks so much for your response :)
$endgroup$
– PSS
Jul 25 at 15:01
add a comment |
$begingroup$
I think it's been asked before. It's useful to realize that, without a prespecified sample size and alpha level, the $p$-value is just a measure of the sample size you ultimately wind up with. Not appealing. An approach I use is this: at what sample size would a 0.05 level be appropriate? Scale accordingly. For instance, I feel the 0.05 level is often suited to problems where there are 100 observations. That is: I would say WOW that is an interesting finding if it had a 1/20 chance of being a false positive. So if you had a sample size of 5,000, that's 50 times larger than 100. So divide your 0.05 level by 50 and come up with 0.001 as a significance level. This is in line with what Fisher advocated: don't do significance testing with p-values, compare them to the power of the study. The sample size is the simplest/rawest measure of the study's power. An overpowered study with a conventional 0.05 cut off makes absolutely no sense.
Usually, it is never advisable to choose a significance cut-off after viewing data and results. One might believe it might be kosher to arbitrarily choose a more stringent significance criterion post hoc. Actually, it only deceives readers into thinking you ran a better controlled trial than you did. Think of it this way: if you observed p = 0.04, you wouldn't be asking this question; the analysis would be a tidy inferential package.
Another way to look at it is this: just report the CI and that the analysis was statistically significant. For instance, you might have a 95% CI for a hazard ratio that goes from (0.01, 0.16) - the null is 1. It suffices to say that the p-value is really freakin' small, so you don't need to clutter the page displaying p=0.0000000023 (don't do this... only show p out to its precision, if 3 decimal places show p < 0.001 and never round to 0.000 - that shows you don't know what a p-value means.).
$endgroup$
$begingroup$
Thanks so much for your response :)
$endgroup$
– PSS
Jul 25 at 15:01
add a comment |
$begingroup$
I think it's been asked before. It's useful to realize that, without a prespecified sample size and alpha level, the $p$-value is just a measure of the sample size you ultimately wind up with. Not appealing. An approach I use is this: at what sample size would a 0.05 level be appropriate? Scale accordingly. For instance, I feel the 0.05 level is often suited to problems where there are 100 observations. That is: I would say WOW that is an interesting finding if it had a 1/20 chance of being a false positive. So if you had a sample size of 5,000, that's 50 times larger than 100. So divide your 0.05 level by 50 and come up with 0.001 as a significance level. This is in line with what Fisher advocated: don't do significance testing with p-values, compare them to the power of the study. The sample size is the simplest/rawest measure of the study's power. An overpowered study with a conventional 0.05 cut off makes absolutely no sense.
Usually, it is never advisable to choose a significance cut-off after viewing data and results. One might believe it might be kosher to arbitrarily choose a more stringent significance criterion post hoc. Actually, it only deceives readers into thinking you ran a better controlled trial than you did. Think of it this way: if you observed p = 0.04, you wouldn't be asking this question; the analysis would be a tidy inferential package.
Another way to look at it is this: just report the CI and that the analysis was statistically significant. For instance, you might have a 95% CI for a hazard ratio that goes from (0.01, 0.16) - the null is 1. It suffices to say that the p-value is really freakin' small, so you don't need to clutter the page displaying p=0.0000000023 (don't do this... only show p out to its precision, if 3 decimal places show p < 0.001 and never round to 0.000 - that shows you don't know what a p-value means.).
$endgroup$
I think it's been asked before. It's useful to realize that, without a prespecified sample size and alpha level, the $p$-value is just a measure of the sample size you ultimately wind up with. Not appealing. An approach I use is this: at what sample size would a 0.05 level be appropriate? Scale accordingly. For instance, I feel the 0.05 level is often suited to problems where there are 100 observations. That is: I would say WOW that is an interesting finding if it had a 1/20 chance of being a false positive. So if you had a sample size of 5,000, that's 50 times larger than 100. So divide your 0.05 level by 50 and come up with 0.001 as a significance level. This is in line with what Fisher advocated: don't do significance testing with p-values, compare them to the power of the study. The sample size is the simplest/rawest measure of the study's power. An overpowered study with a conventional 0.05 cut off makes absolutely no sense.
Usually, it is never advisable to choose a significance cut-off after viewing data and results. One might believe it might be kosher to arbitrarily choose a more stringent significance criterion post hoc. Actually, it only deceives readers into thinking you ran a better controlled trial than you did. Think of it this way: if you observed p = 0.04, you wouldn't be asking this question; the analysis would be a tidy inferential package.
Another way to look at it is this: just report the CI and that the analysis was statistically significant. For instance, you might have a 95% CI for a hazard ratio that goes from (0.01, 0.16) - the null is 1. It suffices to say that the p-value is really freakin' small, so you don't need to clutter the page displaying p=0.0000000023 (don't do this... only show p out to its precision, if 3 decimal places show p < 0.001 and never round to 0.000 - that shows you don't know what a p-value means.).
edited Jul 26 at 15:53
answered Jul 25 at 14:50
AdamOAdamO
37.9k2 gold badges68 silver badges151 bronze badges
37.9k2 gold badges68 silver badges151 bronze badges
$begingroup$
Thanks so much for your response :)
$endgroup$
– PSS
Jul 25 at 15:01
add a comment |
$begingroup$
Thanks so much for your response :)
$endgroup$
– PSS
Jul 25 at 15:01
$begingroup$
Thanks so much for your response :)
$endgroup$
– PSS
Jul 25 at 15:01
$begingroup$
Thanks so much for your response :)
$endgroup$
– PSS
Jul 25 at 15:01
add a comment |
$begingroup$
You have encountered the gulf between "statistically significant" and "meaningful". As you point out, with sufficient sample size, you can assign statistical significance to arbitrarily small differences - there is no difference too small that can't be called "significant" with large enough N. You need to use domain knowledge to determine what is a "meaningful" difference. You might find, for example, that a new drug increases a person's lifespan by 10 seconds - even though you can be very confident that that increase is not due to random variation in your data, it's hardly a meaningful increase in lifespan.
Some of this will come from knowing about your problem and what people in the field consider meaningful. You could also try to think of future studies that might replicate your findings, and the typical N that they might use. If future studies will likely have a much lower N, you could calculate the effect size needed to replicate your findings in data of that size, and only report significant, meaningful, and feasibly reproducible results.
$endgroup$
add a comment |
$begingroup$
You have encountered the gulf between "statistically significant" and "meaningful". As you point out, with sufficient sample size, you can assign statistical significance to arbitrarily small differences - there is no difference too small that can't be called "significant" with large enough N. You need to use domain knowledge to determine what is a "meaningful" difference. You might find, for example, that a new drug increases a person's lifespan by 10 seconds - even though you can be very confident that that increase is not due to random variation in your data, it's hardly a meaningful increase in lifespan.
Some of this will come from knowing about your problem and what people in the field consider meaningful. You could also try to think of future studies that might replicate your findings, and the typical N that they might use. If future studies will likely have a much lower N, you could calculate the effect size needed to replicate your findings in data of that size, and only report significant, meaningful, and feasibly reproducible results.
$endgroup$
add a comment |
$begingroup$
You have encountered the gulf between "statistically significant" and "meaningful". As you point out, with sufficient sample size, you can assign statistical significance to arbitrarily small differences - there is no difference too small that can't be called "significant" with large enough N. You need to use domain knowledge to determine what is a "meaningful" difference. You might find, for example, that a new drug increases a person's lifespan by 10 seconds - even though you can be very confident that that increase is not due to random variation in your data, it's hardly a meaningful increase in lifespan.
Some of this will come from knowing about your problem and what people in the field consider meaningful. You could also try to think of future studies that might replicate your findings, and the typical N that they might use. If future studies will likely have a much lower N, you could calculate the effect size needed to replicate your findings in data of that size, and only report significant, meaningful, and feasibly reproducible results.
$endgroup$
You have encountered the gulf between "statistically significant" and "meaningful". As you point out, with sufficient sample size, you can assign statistical significance to arbitrarily small differences - there is no difference too small that can't be called "significant" with large enough N. You need to use domain knowledge to determine what is a "meaningful" difference. You might find, for example, that a new drug increases a person's lifespan by 10 seconds - even though you can be very confident that that increase is not due to random variation in your data, it's hardly a meaningful increase in lifespan.
Some of this will come from knowing about your problem and what people in the field consider meaningful. You could also try to think of future studies that might replicate your findings, and the typical N that they might use. If future studies will likely have a much lower N, you could calculate the effect size needed to replicate your findings in data of that size, and only report significant, meaningful, and feasibly reproducible results.
answered Jul 25 at 18:28
Nuclear WangNuclear Wang
2,9519 silver badges21 bronze badges
2,9519 silver badges21 bronze badges
add a comment |
add a comment |
$begingroup$
When you have many samples and the observed effect is very small (small for the specified application), you can safely conclude that the independent variables do not have an important effect on the dependent variable. The effect size can be “statistically significant” and unimportant at the same time.
Using small sample size and ignoring the results from the large samples is inappropriate. You owe that to the people that read your paper and design some new experiments based on your observations.
$endgroup$
add a comment |
$begingroup$
When you have many samples and the observed effect is very small (small for the specified application), you can safely conclude that the independent variables do not have an important effect on the dependent variable. The effect size can be “statistically significant” and unimportant at the same time.
Using small sample size and ignoring the results from the large samples is inappropriate. You owe that to the people that read your paper and design some new experiments based on your observations.
$endgroup$
add a comment |
$begingroup$
When you have many samples and the observed effect is very small (small for the specified application), you can safely conclude that the independent variables do not have an important effect on the dependent variable. The effect size can be “statistically significant” and unimportant at the same time.
Using small sample size and ignoring the results from the large samples is inappropriate. You owe that to the people that read your paper and design some new experiments based on your observations.
$endgroup$
When you have many samples and the observed effect is very small (small for the specified application), you can safely conclude that the independent variables do not have an important effect on the dependent variable. The effect size can be “statistically significant” and unimportant at the same time.
Using small sample size and ignoring the results from the large samples is inappropriate. You owe that to the people that read your paper and design some new experiments based on your observations.
edited Jul 25 at 18:32
answered Jul 25 at 18:16
AliAli
11 bronze badge
11 bronze badge
add a comment |
add a comment |
$begingroup$
I think you should decide on an "expected minimal effect size", i.e. the minimal coefficients you care to include in your model. Say, do you care about coefficients less than 0.0001, or 1, or 100? To clarify, the effect size is the degree to which the null hypothesis is false, or how large the coefficient actually is. It's a parameter of the population. On the other hand, the expected minimal effect size is the minimal amount of departure from the null you care to detect. It's a parameter of the test.
Now that you have the sample size $N = 35000$, as well as some expected minimal effect size, a power analysis should reveal the relationship between $alpha$ and $beta$ given there parameters. Next, make another decision about how to balance your significance level and power by choosing a pair of $alpha$ and $beta$. (Technically, all these parameters must be decided before looking at the data, but at this point, I guess you can just pretend you didn't see them.) Then, carry out your test, compare $p$ with $alpha$, and draw a conclusion accordingly.
By the way, I believe there are no reasons to exclude any record, unless you are doing cross-validation, for example. More data generally leads to more accurate inference, and additionally, discarding sample points in a selective manner may introduce bias.
$endgroup$
add a comment |
$begingroup$
I think you should decide on an "expected minimal effect size", i.e. the minimal coefficients you care to include in your model. Say, do you care about coefficients less than 0.0001, or 1, or 100? To clarify, the effect size is the degree to which the null hypothesis is false, or how large the coefficient actually is. It's a parameter of the population. On the other hand, the expected minimal effect size is the minimal amount of departure from the null you care to detect. It's a parameter of the test.
Now that you have the sample size $N = 35000$, as well as some expected minimal effect size, a power analysis should reveal the relationship between $alpha$ and $beta$ given there parameters. Next, make another decision about how to balance your significance level and power by choosing a pair of $alpha$ and $beta$. (Technically, all these parameters must be decided before looking at the data, but at this point, I guess you can just pretend you didn't see them.) Then, carry out your test, compare $p$ with $alpha$, and draw a conclusion accordingly.
By the way, I believe there are no reasons to exclude any record, unless you are doing cross-validation, for example. More data generally leads to more accurate inference, and additionally, discarding sample points in a selective manner may introduce bias.
$endgroup$
add a comment |
$begingroup$
I think you should decide on an "expected minimal effect size", i.e. the minimal coefficients you care to include in your model. Say, do you care about coefficients less than 0.0001, or 1, or 100? To clarify, the effect size is the degree to which the null hypothesis is false, or how large the coefficient actually is. It's a parameter of the population. On the other hand, the expected minimal effect size is the minimal amount of departure from the null you care to detect. It's a parameter of the test.
Now that you have the sample size $N = 35000$, as well as some expected minimal effect size, a power analysis should reveal the relationship between $alpha$ and $beta$ given there parameters. Next, make another decision about how to balance your significance level and power by choosing a pair of $alpha$ and $beta$. (Technically, all these parameters must be decided before looking at the data, but at this point, I guess you can just pretend you didn't see them.) Then, carry out your test, compare $p$ with $alpha$, and draw a conclusion accordingly.
By the way, I believe there are no reasons to exclude any record, unless you are doing cross-validation, for example. More data generally leads to more accurate inference, and additionally, discarding sample points in a selective manner may introduce bias.
$endgroup$
I think you should decide on an "expected minimal effect size", i.e. the minimal coefficients you care to include in your model. Say, do you care about coefficients less than 0.0001, or 1, or 100? To clarify, the effect size is the degree to which the null hypothesis is false, or how large the coefficient actually is. It's a parameter of the population. On the other hand, the expected minimal effect size is the minimal amount of departure from the null you care to detect. It's a parameter of the test.
Now that you have the sample size $N = 35000$, as well as some expected minimal effect size, a power analysis should reveal the relationship between $alpha$ and $beta$ given there parameters. Next, make another decision about how to balance your significance level and power by choosing a pair of $alpha$ and $beta$. (Technically, all these parameters must be decided before looking at the data, but at this point, I guess you can just pretend you didn't see them.) Then, carry out your test, compare $p$ with $alpha$, and draw a conclusion accordingly.
By the way, I believe there are no reasons to exclude any record, unless you are doing cross-validation, for example. More data generally leads to more accurate inference, and additionally, discarding sample points in a selective manner may introduce bias.
answered Jul 26 at 7:21
nalzoknalzok
4804 silver badges15 bronze badges
4804 silver badges15 bronze badges
add a comment |
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f419186%2fbig-sample-size-small-coefficients-significant-results-what-should-i-do%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
One question: with the coefficients that are estimated as in the range of 0.0001 and are statistically significant, what's the distribution of associated covariate? An estimate of 0.0001 is huge if the standard deviation of associated covariate is 100,000.
$endgroup$
– Cliff AB
Jul 25 at 23:37