Which likelihood function is used in linear regression?Definition and delimitation of regression modelWhat are the differences between stochastic v.s. fixed regressors in linear regression model?What is the difference between conditioning on regressors vs. treating them as fixed?Maximizing: likelihood vs likelihood ratioWhen would maximum likelihood estimates equal least squares estimates?Comparing maximum likelihood estimation (MLE) and Bayes' TheoremMaximizing likelihood versus MCMC sampling: Comparing Parameters and DevianceLikelihood - Why multiply?AIC only applicable to maximum likelihood fit (not least squares)?Why does Maximum Likelihood estimation maximizes probability density instead of probabilitylinear regression with gaussian distributionUnderstand a statement about likelihood function

What is the difference between true neutral and unaligned?

Did a flight controller ever answer Flight with a no-go?

What is the best option for High availability on a data warehouse?

Why is Boris Johnson visiting only Paris & Berlin if every member of the EU needs to agree on a withdrawal deal?

how do you harvest carrots in creative mode

Why were the crew so desperate to catch Truman and return him to Seahaven?

Is using a hyperlink to close a modal a poor design decision?

Is there any practical application for performing a double Fourier transform? ...or an inverse Fourier transform on a time-domain input?

If all stars rotate, why was there a theory developed, that requires non-rotating stars?

Why is less being run unnecessarily by git?

Please help me identify the bold slashes between staves

Which household object drew this pattern?

What does どうかと思う mean?

Would it be possible to have a GMO that produces chocolate?

What is the hex versus octal timeline?

Cross-referencing enumerate item

Justifying the use of directed energy weapons

Are modern clipless shoes and pedals that much better than toe clips and straps?

Was there ever a treaty between 2 entities with significantly different translations to the detriment of one party?

Confirming resignation after resignation letter ripped up

Science fiction short story where aliens contact a drunk about Earth's impending destruction

Can pay be witheld for hours cleaning up after closing time?

Are there account age or level requirements for obtaining special research?

Are illustrations in novels frowned upon?



Which likelihood function is used in linear regression?


Definition and delimitation of regression modelWhat are the differences between stochastic v.s. fixed regressors in linear regression model?What is the difference between conditioning on regressors vs. treating them as fixed?Maximizing: likelihood vs likelihood ratioWhen would maximum likelihood estimates equal least squares estimates?Comparing maximum likelihood estimation (MLE) and Bayes' TheoremMaximizing likelihood versus MCMC sampling: Comparing Parameters and DevianceLikelihood - Why multiply?AIC only applicable to maximum likelihood fit (not least squares)?Why does Maximum Likelihood estimation maximizes probability density instead of probabilitylinear regression with gaussian distributionUnderstand a statement about likelihood function






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








5












$begingroup$


When trying to derive the maximum likelihood estimation for a linear regression, We start by a likelihood function. Does it matter if we use either of these 2 forms?
$P(y|x,w)$
$P(y,x|w)$

All pages that I read on the internet use the first one.

I found that $P(y,x|w)$ is equal to $P(y|x,w)*P(x)$

so maximizing $P(y,x|w)$ with respect to $w$ is the same as $P(y|x,w)$ because $x$ and $w$ are independent.

The second function looks better because it means "what is the probability of the parameter giving the data(x AND y)" but the first function doesn't show that.

Is my point correct or not?Is there any difference?










share|cite|improve this question











$endgroup$









  • 1




    $begingroup$
    There are (many) votes to close, but seems clear&ontopic to mee.
    $endgroup$
    – kjetil b halvorsen
    Aug 15 at 8:47










  • $begingroup$
    What $w$ stands for here?
    $endgroup$
    – Alecos Papadopoulos
    Aug 15 at 18:10










  • $begingroup$
    @AlecosPapadopoulos it stands for $theta$. The parameters in the model
    $endgroup$
    – floyd
    Aug 16 at 14:44

















5












$begingroup$


When trying to derive the maximum likelihood estimation for a linear regression, We start by a likelihood function. Does it matter if we use either of these 2 forms?
$P(y|x,w)$
$P(y,x|w)$

All pages that I read on the internet use the first one.

I found that $P(y,x|w)$ is equal to $P(y|x,w)*P(x)$

so maximizing $P(y,x|w)$ with respect to $w$ is the same as $P(y|x,w)$ because $x$ and $w$ are independent.

The second function looks better because it means "what is the probability of the parameter giving the data(x AND y)" but the first function doesn't show that.

Is my point correct or not?Is there any difference?










share|cite|improve this question











$endgroup$









  • 1




    $begingroup$
    There are (many) votes to close, but seems clear&ontopic to mee.
    $endgroup$
    – kjetil b halvorsen
    Aug 15 at 8:47










  • $begingroup$
    What $w$ stands for here?
    $endgroup$
    – Alecos Papadopoulos
    Aug 15 at 18:10










  • $begingroup$
    @AlecosPapadopoulos it stands for $theta$. The parameters in the model
    $endgroup$
    – floyd
    Aug 16 at 14:44













5












5








5


2



$begingroup$


When trying to derive the maximum likelihood estimation for a linear regression, We start by a likelihood function. Does it matter if we use either of these 2 forms?
$P(y|x,w)$
$P(y,x|w)$

All pages that I read on the internet use the first one.

I found that $P(y,x|w)$ is equal to $P(y|x,w)*P(x)$

so maximizing $P(y,x|w)$ with respect to $w$ is the same as $P(y|x,w)$ because $x$ and $w$ are independent.

The second function looks better because it means "what is the probability of the parameter giving the data(x AND y)" but the first function doesn't show that.

Is my point correct or not?Is there any difference?










share|cite|improve this question











$endgroup$




When trying to derive the maximum likelihood estimation for a linear regression, We start by a likelihood function. Does it matter if we use either of these 2 forms?
$P(y|x,w)$
$P(y,x|w)$

All pages that I read on the internet use the first one.

I found that $P(y,x|w)$ is equal to $P(y|x,w)*P(x)$

so maximizing $P(y,x|w)$ with respect to $w$ is the same as $P(y|x,w)$ because $x$ and $w$ are independent.

The second function looks better because it means "what is the probability of the parameter giving the data(x AND y)" but the first function doesn't show that.

Is my point correct or not?Is there any difference?







regression mathematical-statistics maximum-likelihood






share|cite|improve this question















share|cite|improve this question













share|cite|improve this question




share|cite|improve this question








edited Aug 15 at 15:09







floyd

















asked Aug 10 at 16:14









floydfloyd

3815 silver badges20 bronze badges




3815 silver badges20 bronze badges










  • 1




    $begingroup$
    There are (many) votes to close, but seems clear&ontopic to mee.
    $endgroup$
    – kjetil b halvorsen
    Aug 15 at 8:47










  • $begingroup$
    What $w$ stands for here?
    $endgroup$
    – Alecos Papadopoulos
    Aug 15 at 18:10










  • $begingroup$
    @AlecosPapadopoulos it stands for $theta$. The parameters in the model
    $endgroup$
    – floyd
    Aug 16 at 14:44












  • 1




    $begingroup$
    There are (many) votes to close, but seems clear&ontopic to mee.
    $endgroup$
    – kjetil b halvorsen
    Aug 15 at 8:47










  • $begingroup$
    What $w$ stands for here?
    $endgroup$
    – Alecos Papadopoulos
    Aug 15 at 18:10










  • $begingroup$
    @AlecosPapadopoulos it stands for $theta$. The parameters in the model
    $endgroup$
    – floyd
    Aug 16 at 14:44







1




1




$begingroup$
There are (many) votes to close, but seems clear&ontopic to mee.
$endgroup$
– kjetil b halvorsen
Aug 15 at 8:47




$begingroup$
There are (many) votes to close, but seems clear&ontopic to mee.
$endgroup$
– kjetil b halvorsen
Aug 15 at 8:47












$begingroup$
What $w$ stands for here?
$endgroup$
– Alecos Papadopoulos
Aug 15 at 18:10




$begingroup$
What $w$ stands for here?
$endgroup$
– Alecos Papadopoulos
Aug 15 at 18:10












$begingroup$
@AlecosPapadopoulos it stands for $theta$. The parameters in the model
$endgroup$
– floyd
Aug 16 at 14:44




$begingroup$
@AlecosPapadopoulos it stands for $theta$. The parameters in the model
$endgroup$
– floyd
Aug 16 at 14:44










3 Answers
3






active

oldest

votes


















3













$begingroup$

As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.



Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcalN(0, sigma^2)$, we want the estimate function $hatf$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hatf$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hatf$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.



More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
$$min_f textEPE(f) = min_f mathbbE(Y - f(X))^2$$
omitting some computation we obtain the minimizing $f$ to be
$$f(x) = mathbbE(Y | X = x)$$
so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.



However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.






share|cite|improve this answer









$endgroup$










  • 1




    $begingroup$
    Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
    $endgroup$
    – floyd
    Aug 10 at 18:43






  • 1




    $begingroup$
    I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
    $endgroup$
    – Drew N
    Aug 10 at 19:27


















2













$begingroup$

Linear regression is about how $x$ influences/varies with $y$, the outcome or response variable. The model equation is
$$ Y_i =beta_0 + beta_1 x_i1 + dotsm + beta_p x_ip+epsilon_i
$$
say, and how $x$ is distributed doesn't by itself give information about the $beta$'s. That's why your second form of likelihood is irrelevant, so is not used.



See Definition and delimitation of regression model for the meaning of *regression model**, also What are the differences between stochastic v.s. fixed regressors in linear regression model? and especially my answer here: What is the difference between conditioning on regressors vs. treating them as fixed?






share|cite|improve this answer









$endgroup$






















    1













    $begingroup$

    That's a good question since the difference is a bit subtle - hopefully this helps.



    The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).



    Usually, in simple linear regression,



    $$Y = beta_0 + beta_1 X + epsilon$$



    you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.



    Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore



    $$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$



    If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.






    share|cite|improve this answer









    $endgroup$

















      Your Answer








      StackExchange.ready(function()
      var channelOptions =
      tags: "".split(" "),
      id: "65"
      ;
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function()
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled)
      StackExchange.using("snippets", function()
      createEditor();
      );

      else
      createEditor();

      );

      function createEditor()
      StackExchange.prepareEditor(
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader:
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      ,
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      );



      );













      draft saved

      draft discarded


















      StackExchange.ready(
      function ()
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f421598%2fwhich-likelihood-function-is-used-in-linear-regression%23new-answer', 'question_page');

      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      3













      $begingroup$

      As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.



      Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcalN(0, sigma^2)$, we want the estimate function $hatf$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hatf$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hatf$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.



      More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
      $$min_f textEPE(f) = min_f mathbbE(Y - f(X))^2$$
      omitting some computation we obtain the minimizing $f$ to be
      $$f(x) = mathbbE(Y | X = x)$$
      so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.



      However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.






      share|cite|improve this answer









      $endgroup$










      • 1




        $begingroup$
        Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
        $endgroup$
        – floyd
        Aug 10 at 18:43






      • 1




        $begingroup$
        I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
        $endgroup$
        – Drew N
        Aug 10 at 19:27















      3













      $begingroup$

      As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.



      Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcalN(0, sigma^2)$, we want the estimate function $hatf$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hatf$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hatf$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.



      More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
      $$min_f textEPE(f) = min_f mathbbE(Y - f(X))^2$$
      omitting some computation we obtain the minimizing $f$ to be
      $$f(x) = mathbbE(Y | X = x)$$
      so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.



      However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.






      share|cite|improve this answer









      $endgroup$










      • 1




        $begingroup$
        Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
        $endgroup$
        – floyd
        Aug 10 at 18:43






      • 1




        $begingroup$
        I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
        $endgroup$
        – Drew N
        Aug 10 at 19:27













      3














      3










      3







      $begingroup$

      As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.



      Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcalN(0, sigma^2)$, we want the estimate function $hatf$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hatf$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hatf$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.



      More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
      $$min_f textEPE(f) = min_f mathbbE(Y - f(X))^2$$
      omitting some computation we obtain the minimizing $f$ to be
      $$f(x) = mathbbE(Y | X = x)$$
      so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.



      However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.






      share|cite|improve this answer









      $endgroup$



      As you say, in this case the choice doesn't matter; the maximization over $w$ is the same. But you touch on a good point.



      Typically for regression problems we prefer to write the version with the conditioning on $x$ because we think of regression as modeling the conditional distribution $p(y | x)$ while ignoring the details of what $p(x)$ may look like. This is because for the model $Y = f(X) + varepsilon$, where $varepsilon sim mathcalN(0, sigma^2)$, we want the estimate function $hatf$ we produce to be as close to $f$ as possible on average, measured by, say, least squares. This is the same thing as pulling $hatf$ close to $f$ on all its input values, individually---it doesn't matter the distribution $p(x)$ of how the inputs are scattered, as long as $hatf$ is close to $f$ on those values. We only want to model how $y$ responds to $x$, and we don't care about how $x$ itself.



      More mathematically, from an MLE standpoint the goal is to get our likelihoods $P(x,y|w)$ as large as possible, but as you say, we are not modeling $P(x)$ through $w$, so $P(x)$ factors out and doesn't matter. From a fitting standpoint, if we want to minimize expected prediction error over $f$,
      $$min_f textEPE(f) = min_f mathbbE(Y - f(X))^2$$
      omitting some computation we obtain the minimizing $f$ to be
      $$f(x) = mathbbE(Y | X = x)$$
      so for least squares loss, the best possible $f$ depends only on the conditional distribution $p(y | x)$, and estimating it should not require any additional information.



      However, in classification problems (logistic regression, linear discriminant analysis, Naive Bayes) there is a difference between a "generative" and a "discriminative" model; generative models do not condition on $x$ and discriminative models do. For instance, in linear discriminant analysis we do model $P(x)$ through $w$---we also do not use least squares loss in those.







      share|cite|improve this answer












      share|cite|improve this answer



      share|cite|improve this answer










      answered Aug 10 at 18:09









      Drew N Drew N

      3952 silver badges9 bronze badges




      3952 silver badges9 bronze badges










      • 1




        $begingroup$
        Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
        $endgroup$
        – floyd
        Aug 10 at 18:43






      • 1




        $begingroup$
        I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
        $endgroup$
        – Drew N
        Aug 10 at 19:27












      • 1




        $begingroup$
        Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
        $endgroup$
        – floyd
        Aug 10 at 18:43






      • 1




        $begingroup$
        I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
        $endgroup$
        – Drew N
        Aug 10 at 19:27







      1




      1




      $begingroup$
      Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
      $endgroup$
      – floyd
      Aug 10 at 18:43




      $begingroup$
      Thank you for the explanation. I saw some pages on the internet call $X$ "the model" and $y$ "the sample/data" and the likelihood is $P(data|m, w)$ where $m$ stands for "model". So they write the likelihood as $P(y|x,w)$. In my question I called both $x$ AND $y$ "data", Am I right? or should I call $x$ "the model"?
      $endgroup$
      – floyd
      Aug 10 at 18:43




      1




      1




      $begingroup$
      I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
      $endgroup$
      – Drew N
      Aug 10 at 19:27




      $begingroup$
      I wouldn't call $X$ the model. When they write $m$ that way, they're conditioning on a probabilistic model for the data in addition to the data $x$; when they write just $x$ instead of $m$ the model is assumed implicitly.
      $endgroup$
      – Drew N
      Aug 10 at 19:27













      2













      $begingroup$

      Linear regression is about how $x$ influences/varies with $y$, the outcome or response variable. The model equation is
      $$ Y_i =beta_0 + beta_1 x_i1 + dotsm + beta_p x_ip+epsilon_i
      $$
      say, and how $x$ is distributed doesn't by itself give information about the $beta$'s. That's why your second form of likelihood is irrelevant, so is not used.



      See Definition and delimitation of regression model for the meaning of *regression model**, also What are the differences between stochastic v.s. fixed regressors in linear regression model? and especially my answer here: What is the difference between conditioning on regressors vs. treating them as fixed?






      share|cite|improve this answer









      $endgroup$



















        2













        $begingroup$

        Linear regression is about how $x$ influences/varies with $y$, the outcome or response variable. The model equation is
        $$ Y_i =beta_0 + beta_1 x_i1 + dotsm + beta_p x_ip+epsilon_i
        $$
        say, and how $x$ is distributed doesn't by itself give information about the $beta$'s. That's why your second form of likelihood is irrelevant, so is not used.



        See Definition and delimitation of regression model for the meaning of *regression model**, also What are the differences between stochastic v.s. fixed regressors in linear regression model? and especially my answer here: What is the difference between conditioning on regressors vs. treating them as fixed?






        share|cite|improve this answer









        $endgroup$

















          2














          2










          2







          $begingroup$

          Linear regression is about how $x$ influences/varies with $y$, the outcome or response variable. The model equation is
          $$ Y_i =beta_0 + beta_1 x_i1 + dotsm + beta_p x_ip+epsilon_i
          $$
          say, and how $x$ is distributed doesn't by itself give information about the $beta$'s. That's why your second form of likelihood is irrelevant, so is not used.



          See Definition and delimitation of regression model for the meaning of *regression model**, also What are the differences between stochastic v.s. fixed regressors in linear regression model? and especially my answer here: What is the difference between conditioning on regressors vs. treating them as fixed?






          share|cite|improve this answer









          $endgroup$



          Linear regression is about how $x$ influences/varies with $y$, the outcome or response variable. The model equation is
          $$ Y_i =beta_0 + beta_1 x_i1 + dotsm + beta_p x_ip+epsilon_i
          $$
          say, and how $x$ is distributed doesn't by itself give information about the $beta$'s. That's why your second form of likelihood is irrelevant, so is not used.



          See Definition and delimitation of regression model for the meaning of *regression model**, also What are the differences between stochastic v.s. fixed regressors in linear regression model? and especially my answer here: What is the difference between conditioning on regressors vs. treating them as fixed?







          share|cite|improve this answer












          share|cite|improve this answer



          share|cite|improve this answer










          answered Aug 15 at 8:46









          kjetil b halvorsenkjetil b halvorsen

          36.3k9 gold badges90 silver badges283 bronze badges




          36.3k9 gold badges90 silver badges283 bronze badges
























              1













              $begingroup$

              That's a good question since the difference is a bit subtle - hopefully this helps.



              The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).



              Usually, in simple linear regression,



              $$Y = beta_0 + beta_1 X + epsilon$$



              you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.



              Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore



              $$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$



              If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.






              share|cite|improve this answer









              $endgroup$



















                1













                $begingroup$

                That's a good question since the difference is a bit subtle - hopefully this helps.



                The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).



                Usually, in simple linear regression,



                $$Y = beta_0 + beta_1 X + epsilon$$



                you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.



                Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore



                $$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$



                If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.






                share|cite|improve this answer









                $endgroup$

















                  1














                  1










                  1







                  $begingroup$

                  That's a good question since the difference is a bit subtle - hopefully this helps.



                  The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).



                  Usually, in simple linear regression,



                  $$Y = beta_0 + beta_1 X + epsilon$$



                  you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.



                  Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore



                  $$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$



                  If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.






                  share|cite|improve this answer









                  $endgroup$



                  That's a good question since the difference is a bit subtle - hopefully this helps.



                  The difference between in calculating $p(y|x,w)$ and $p(y,x|w)$ lies in whether you model X as being fixed or random. In the first, you're looking for the joint probability of X and Y conditioned on w, whereas on the 2nd one, you want to determine the probability of Y conditioned on X and w (usually denoted as $beta$ in statistical settings).



                  Usually, in simple linear regression,



                  $$Y = beta_0 + beta_1 X + epsilon$$



                  you model X as being fixed and independent from $epsilon$ which follows a $sim N(0,sigma^2)$ distribution. That is, Y is modeled as a linear function of some fixed input (X) with some random noise ($epsilon$). Therefore, it makes sense to model it as $p(y|X,w)$ in this setting as X is fixed or modeled as constant, which is usually denoted explicitly as $X=x$.



                  Mathematically, when X is fixed, p(X) =1 as it is constant. Therefore



                  $$p(y,X|w) = p(y|X,w)p(X) implies p(y,X|w)= p(y|X,w)$$



                  If X is not constant and given, then p(X) is no longer 1, and the above doesn't hold, so it all comes down to whether you model X as being random or fixed.







                  share|cite|improve this answer












                  share|cite|improve this answer



                  share|cite|improve this answer










                  answered Aug 10 at 18:01









                  Samir Rachid ZaimSamir Rachid Zaim

                  1215 bronze badges




                  1215 bronze badges






























                      draft saved

                      draft discarded
















































                      Thanks for contributing an answer to Cross Validated!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid


                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.

                      Use MathJax to format equations. MathJax reference.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function ()
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f421598%2fwhich-likelihood-function-is-used-in-linear-regression%23new-answer', 'question_page');

                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

                      Circuit construction for execution of conditional statements using least significant bitHow are two different registers being used as “control”?How exactly is the stated composite state of the two registers being produced using the $R_zz$ controlled rotations?Efficiently performing controlled rotations in HHLWould this quantum algorithm implementation work?How to prepare a superposed states of odd integers from $1$ to $sqrtN$?Why is this implementation of the order finding algorithm not working?Circuit construction for Hamiltonian simulationHow can I invert the least significant bit of a certain term of a superposed state?Implementing an oracleImplementing a controlled sum operation

                      Magento 2 “No Payment Methods” in Admin New OrderHow to integrate Paypal Express Checkout with the Magento APIMagento 1.5 - Sales > Order > edit order and shipping methods disappearAuto Invoice Check/Money Order Payment methodAdd more simple payment methods?Shipping methods not showingWhat should I do to change payment methods if changing the configuration has no effects?1.9 - No Payment Methods showing upMy Payment Methods not Showing for downloadable/virtual product when checkout?Magento2 API to access internal payment methodHow to call an existing payment methods in the registration form?