Are the errors in this formulation of the simple linear regression model random variables?Expected Value and Variance of Estimation of Slope Parameter $beta_1$ in Simple Linear RegressionIn simple linear regression, what is the covariance between the error term and the residual?How to get the Standard Error of linear-regression parameters?Ridge regression formulation as constrained versus penalized: How are they equivalent?Linear regression and interpretation of random variablesProblem regarding the concept of random error component in simple regression model and the nature of its varianceErrors and residuals in linear regressionConfusion in terminologies for simple linear regression modelShow that target variable is gaussian in simple linear regressionFormulating quantile regression as Linear Programming problem?

What is "ass door"?

What is a reasonable time for modern human society to adapt to dungeons?

Can two figures have the same area, perimeter, and same number of segments have different shape?

Does static fire reduce reliability?

Keeping an "hot eyeball planet" wet

This message is flooding my syslog, how to find where it comes from?

kids pooling money for Lego League and taxes

Do Rabbis get punished in Heaven for wrong interpretations or claims?

What should I say when a company asks you why someone (a friend) who was fired left?

What is the difference between $path and $PATH (lowercase versus uppercase) with zsh?

The seven story archetypes. Are they truly all of them?

Sextortion with actual password not found in leaks

What exactly makes a General Products hull nearly indestructible?

Explanation for a joke about a three-legged dog that walks into a bar

How may I shorten this shell script?

Sitecore Powershell extensions module compatibility with Sitecore 9.2

Where is this photo of a group of hikers taken? Is it really in the Ural?

Using "Kollege" as "university friend"?

Why do websites not use the HaveIBeenPwned API to warn users about exposed passwords?

What does the minus sign mean in measurements in datasheet footprint drawings?

Why did Saturn V not head straight to the moon?

Current relevance: "She has broken her leg" vs. "She broke her leg yesterday"

How do campaign rallies gain candidates votes?

No-cloning theorem does not seem precise

Are the errors in this formulation of the simple linear regression model random variables?

Expected Value and Variance of Estimation of Slope Parameter $beta_1$ in Simple Linear RegressionIn simple linear regression, what is the covariance between the error term and the residual?How to get the Standard Error of linear-regression parameters?Ridge regression formulation as constrained versus penalized: How are they equivalent?Linear regression and interpretation of random variablesProblem regarding the concept of random error component in simple regression model and the nature of its varianceErrors and residuals in linear regressionConfusion in terminologies for simple linear regression modelShow that target variable is gaussian in simple linear regressionFormulating quantile regression as Linear Programming problem?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

On page 21 of Applied Linear Regression, fourth edition, by Sanford Weisberg, the error $e_i$ for case $i$ under the simple linear regression model is defined to be $y_i - E(Y | X = x_i)$, where $E(Y | X = x_i)$ is assumed to equal $beta_0 + beta_1 x_i$ for some unknown $beta_0, beta_1 in mathbbR$. The book says that

The errors $e_i$ depend on unknown parameters in the mean function and so are not observable quantities. They are random variables and correspond to the vertical distance between the point $y_i$ and the mean function $E(Y | X = x_i)$.

It doesn't seem to me like $e_i$ is a random variable, because it's a function of $y_i$ and $x_i$, which are non-random, observed values. Why can $e_i$ be considered a random variable?

asked Jul 15 at 15:23

VKV

1353 bronze badges

add a comment |

The errors $e_i$ depend on unknown parameters in the mean function and so are not observable quantities. They are random variables and correspond to the vertical distance between the point $y_i$ and the mean function $E(Y | X = x_i)$.

It doesn't seem to me like $e_i$ is a random variable, because it's a function of $y_i$ and $x_i$, which are non-random, observed values. Why can $e_i$ be considered a random variable?

asked Jul 15 at 15:23

VKV

1353 bronze badges

add a comment |

The errors $e_i$ depend on unknown parameters in the mean function and so are not observable quantities. They are random variables and correspond to the vertical distance between the point $y_i$ and the mean function $E(Y | X = x_i)$.

It doesn't seem to me like $e_i$ is a random variable, because it's a function of $y_i$ and $x_i$, which are non-random, observed values. Why can $e_i$ be considered a random variable?

asked Jul 15 at 15:23

VKV

1353 bronze badges

The errors $e_i$ depend on unknown parameters in the mean function and so are not observable quantities. They are random variables and correspond to the vertical distance between the point $y_i$ and the mean function $E(Y | X = x_i)$.

It doesn't seem to me like $e_i$ is a random variable, because it's a function of $y_i$ and $x_i$, which are non-random, observed values. Why can $e_i$ be considered a random variable?

regression random-variable assumptions

asked Jul 15 at 15:23

VKV

1353 bronze badges

asked Jul 15 at 15:23

VKV

1353 bronze badges

asked Jul 15 at 15:23

VKV

1353 bronze badges

asked Jul 15 at 15:23

VKV

1353 bronze badges

asked Jul 15 at 15:23

VKV

1353 bronze badges

add a comment |

2 Answers
2

active

oldest

votes

I looked up your citation (4th edition, page 21) because I found it very alarming and was relieved to find is actually given as:

$$ hate_i = y_i − widehatE(Y|X=x_i) = y_i - (hatbeta_0 + hatbeta_1) tag2.3 $$

Which is still confusing, I grant you, and the difference isn't actually germane to your question, but at least it isn't patently false. I'll explain why I found it alarming before discussing your (unrelated, I think) question. The "hat" indicates "estimated", usually by MLE in the context of linear regression, and there is a crucial distinction between "true errors" which are denoted $epsilon_i$ and are normally distributed and i.i.d., and "residuals which are denoted $e_i$ and are not i.i.d. The formula without the hats would imply the two are exactly equal which is not the case.

On to your real question, which boils down to, "are the given data $x_i$ and $y_i$ random or not?"

If you believe the pairs $(x_i, y_i)$ are known and not-random, e.g. that is, if you believe that $forall; 1 leq i leq n,, (x_i, y_i) in mathbbR times mathbbR $, then the residuals $e_i$ are also known and non-random, e.g. $forall; 1 leq i leq n,, e_i in mathbbR$. This is because there is a deterministic function for the "best" parameters $hatbeta_0$ and $hatbeta_1$ from those observations, and then a deterministic function for the residuals in terms of those parameters. This point of view is useful and allows us to derive the MLE estimators of $beta$, for example. It is also the most intuitive view to take when your sitting in front of a concrete, real-world dataset.

However, it kind of puts the cart before the horse and basically shuts down certain kinds of statistical analysis. For example, we cannot talk about the "distribution" of $hatbeta_1$ because it is not a random variable and therefore has no distribution! How can we then talk something like the Wald test? Likewise, how do we talk about the "distribution" of residuals so that we can say whether one is an outlier or not?

The way this is done is treating the dataset itself as random. When we want to do statistical inference on a known dataset, we can then treat the known values as a realization of the random dataset. The exact construction is a little bit pedantic but and is often omitted but it helps to go through it at least once. First, we say that $X$ and $Y$ are two random variables with some joint probability distribution $F_X,Y(mathbfbeta, sigma^2)$ with parameters $mathbfbeta = [beta_0, beta_1]^T $ and $sigma$. $F_X,Y$ is specified by the model $Y = Xbeta_0 + beta_1 + epsilon, epsilon sim mathcal(0, sigma^2)$. Now, imagine that we have $n$ i.i.d. copies of $F_X,Y$ that we combine into one big joint probability function $F_X_1,Y_1,X_2,Y_2,...,X_n,Y_n$.

Now we can imagine the dataset $(x_i, y_i)$ for $i=1,...,n$ not merely as some known set of numbers, but as a realization sampled from $F_X_1,Y_1,X_2,Y_2,...,X_n,Y_n$. Each time we sample, we don't just get one pair of numbers, we get $n$ pairs of numbers: a brand new dataset. But that means the parameters $hatbeta$ get new estimates, and we then calculate new residuals $e_i$, right?

Instead of thinking of this as repeated sampling, which is somewhat crude, we can express this entirely in the algebra of random variables. It can be expressed as two $n$-dimensional random vectors $vecX$ and $vecY$ drawn from $F_X_1,Y_1,X_2,Y_2,...,X_n,Y_n$. Now $hatbeta_0$ and $hatbeta_1$ are random variables because they are functions of $(vecX, vecY)$. Likewise, all the $e_i$ are random variables because they are functions of $(vecX, vecY)$.

This state of affairs is much better, because now we can make statements like "The set of residuals $e_i$ cannot be independent because they always sum exactly to zero" or "the standard error of $hatbeta_1$ follows a t-distribution." without talking literal nonsense. (Both of these statements only make sense if their subjects are random variables.)

In the real world we can't always go and get a brand-new, randomly sampled dataset. We can approximate this with something like the bootstrap, of course, but doing it for real isn't usually practical. But doing it conceptually allows us to think clearly about how randomness during sampling would affect our regression.

You'll note that I did not introduce new notation for $e_i$ and $hatbeta$ but simply said, "now these things, which we previously thought of a concrete realizations, will now be treated as random variables." As far as I can tell, you just have to be on your toes for this kind of signposting - the same kind you found in your textbook - to indicate whether symbols are referring to random or non-random variables because while there are conventions (such as using uppercase roman letters for random variables) they are not consistently applied. If the author tells you $e_i$ is a random variable, he is telling you is also viewing $x_i$ and $y_i$ as random variables.

edited Jul 15 at 17:26

Tim♦

63.1k9 gold badges141 silver badges238 bronze badges

answered Jul 15 at 17:12

olooney

2,1989 silver badges20 bronze badges

add a comment |

In simple linear regression, we assume that the observations are randomly perturbed from the conditional expected value, i.e. $E[Y|X=x_i]$; so, each of your observations are assumed to be generated from a model of the form: $$Y=beta_0+beta_1X+epsilon , epsilonsim N(0,sigma^2)$$

This makes each $epsilon_i$ a RV by definition. Think about a box where you give $x_i$ and get $y_i$, and you never know what's inside, how much error is introduced by the box etc. Even if we really know that the relation is of the form given above, we don't know the true $beta_0,beta_1$. If we had known those quantities, we would easily recover $epsilon_i$. Instead, we estimate those, and get residuals.

answered Jul 15 at 17:05

gunes

12.1k1 gold badge5 silver badges22 bronze badges

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f417529%2fare-the-errors-in-this-formulation-of-the-simple-linear-regression-model-random%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

I looked up your citation (4th edition, page 21) because I found it very alarming and was relieved to find is actually given as:

$$ hate_i = y_i − widehatE(Y|X=x_i) = y_i - (hatbeta_0 + hatbeta_1) tag2.3 $$

On to your real question, which boils down to, "are the given data $x_i$ and $y_i$ random or not?"

edited Jul 15 at 17:26

Tim♦

63.1k9 gold badges141 silver badges238 bronze badges

answered Jul 15 at 17:12

olooney

2,1989 silver badges20 bronze badges

add a comment |

I looked up your citation (4th edition, page 21) because I found it very alarming and was relieved to find is actually given as:

$$ hate_i = y_i − widehatE(Y|X=x_i) = y_i - (hatbeta_0 + hatbeta_1) tag2.3 $$

On to your real question, which boils down to, "are the given data $x_i$ and $y_i$ random or not?"

edited Jul 15 at 17:26

Tim♦

63.1k9 gold badges141 silver badges238 bronze badges

answered Jul 15 at 17:12

olooney

2,1989 silver badges20 bronze badges

add a comment |

I looked up your citation (4th edition, page 21) because I found it very alarming and was relieved to find is actually given as:

$$ hate_i = y_i − widehatE(Y|X=x_i) = y_i - (hatbeta_0 + hatbeta_1) tag2.3 $$

On to your real question, which boils down to, "are the given data $x_i$ and $y_i$ random or not?"

edited Jul 15 at 17:26

Tim♦

63.1k9 gold badges141 silver badges238 bronze badges

answered Jul 15 at 17:12

olooney

2,1989 silver badges20 bronze badges

I looked up your citation (4th edition, page 21) because I found it very alarming and was relieved to find is actually given as:

$$ hate_i = y_i − widehatE(Y|X=x_i) = y_i - (hatbeta_0 + hatbeta_1) tag2.3 $$

On to your real question, which boils down to, "are the given data $x_i$ and $y_i$ random or not?"

edited Jul 15 at 17:26

Tim♦

63.1k9 gold badges141 silver badges238 bronze badges

answered Jul 15 at 17:12

olooney

2,1989 silver badges20 bronze badges

edited Jul 15 at 17:26

Tim♦

63.1k9 gold badges141 silver badges238 bronze badges

edited Jul 15 at 17:26

Tim♦

63.1k9 gold badges141 silver badges238 bronze badges

edited Jul 15 at 17:26

Tim♦

63.1k9 gold badges141 silver badges238 bronze badges

answered Jul 15 at 17:12

olooney

2,1989 silver badges20 bronze badges

answered Jul 15 at 17:12

olooney

2,1989 silver badges20 bronze badges

answered Jul 15 at 17:12

olooney

2,1989 silver badges20 bronze badges

add a comment |

answered Jul 15 at 17:05

gunes

12.1k1 gold badge5 silver badges22 bronze badges

add a comment |

answered Jul 15 at 17:05

gunes

12.1k1 gold badge5 silver badges22 bronze badges

add a comment |

answered Jul 15 at 17:05

gunes

12.1k1 gold badge5 silver badges22 bronze badges

answered Jul 15 at 17:05

gunes

12.1k1 gold badge5 silver badges22 bronze badges

answered Jul 15 at 17:05

gunes

12.1k1 gold badge5 silver badges22 bronze badges

answered Jul 15 at 17:05

gunes

12.1k1 gold badge5 silver badges22 bronze badges

answered Jul 15 at 17:05

gunes

12.1k1 gold badge5 silver badges22 bronze badges

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ttdfjt

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

2 Answers
2

2 Answers
2

2 Answers
2