Can i include the product of two random variables? Or do I risk collinearity?SEM: Collinearity between two latent variables that are used to predict a third latent variableInclude dummy variables in model - always necessary?Why do we use the term multicollinearity, when the vectors representing two variables are never truly collinear?What are the merits of different approaches to detecting collinearity?Perfect collinearity between one level of two categorical variablesLinear “self” regression, terminology and references?High correlation among two variables but VIFs do not indicate collinearityCollinear predictors in a GLMHow do I handle terms with collinearity?Interpreting interaction term on highly correlated variables

Why did the Germans forbid the possession of pet pigeons in Rostov-on-Don in 1941?

Why Is Death Allowed In the Matrix?

How to get the available space of $HOME as a variable in shell scripting?

If I cast Expeditious Retreat, can I Dash as a bonus action on the same turn?

Copenhagen passport control - US citizen

Theorems that impeded progress

I probably found a bug with the sudo apt install function

What defenses are there against being summoned by the Gate spell?

Can I interfere when another PC is about to be attacked?

A function which translates a sentence to title-case

Can a German sentence have two subjects?

When blogging recipes, how can I support both readers who want the narrative/journey and ones who want the printer-friendly recipe?

Motorized valve interfering with button?

Draw simple lines in Inkscape

Is it possible to do 50 km distance without any previous training?

Do any Labour MPs support no-deal?

What would happen to a modern skyscraper if it rains micro blackholes?

Patience, young "Padovan"

Can an x86 CPU running in real mode be considered to be basically an 8086 CPU?

How to type dʒ symbol (IPA) on Mac?

Are tax years 2016 & 2017 back taxes deductible for tax year 2018?

What typically incentivizes a professor to change jobs to a lower ranking university?

Why has Russell's definition of numbers using equivalence classes been finally abandoned? ( If it has actually been abandoned).

How to report a triplet of septets in NMR tabulation?



Can i include the product of two random variables? Or do I risk collinearity?


SEM: Collinearity between two latent variables that are used to predict a third latent variableInclude dummy variables in model - always necessary?Why do we use the term multicollinearity, when the vectors representing two variables are never truly collinear?What are the merits of different approaches to detecting collinearity?Perfect collinearity between one level of two categorical variablesLinear “self” regression, terminology and references?High correlation among two variables but VIFs do not indicate collinearityCollinear predictors in a GLMHow do I handle terms with collinearity?Interpreting interaction term on highly correlated variables






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








5












$begingroup$


I have a model in which I want to predict Y.



My regressors X, are x1 and x2.



For some reason I believe that it would also be useful to include into the model:



  • x3 = x1 * x2

  • x4 = x1 / x2

Can I use a regressors x1, x2, x3 and x4 altogether or do I risk perfect collinearity problem.
I know for instance that using x5 = x1 + x2 would yield perfect collinearity and hence a completely useless regressor.










share|cite|improve this question









$endgroup$







  • 4




    $begingroup$
    x3 is commonly called an interaction term.
    $endgroup$
    – COOLSerdash
    yesterday










  • $begingroup$
    @scugn1zz0 accept the other answer :)
    $endgroup$
    – gunes
    yesterday






  • 1




    $begingroup$
    To would-be respondents: If you believe the OP's model is always problematic, you are asserting that all quartic polynomial regressions are problematic. To see why that is, suppose that in reality the response $Y$ has a quartic polynomial relationship with a single variable $x:$ $$E[Y]=beta_0+beta_1x+beta_2x^2+beta_3x^3+beta_4x^4.$$ Suppose one has measured not only $x_2=x$ but also--unwittingly--$x_1=x^3.$ In terms of these variables, $x^2=x_1/x_2$ and $x^4=x_1x_2.$ Thus, this model is identical to $$E[Y]=beta_0+beta_1x_2+beta_2x_1/x_2+beta_3x_1+beta_4x_1x_2.$$
    $endgroup$
    – whuber
    yesterday











  • $begingroup$
    OP has not specified the type of regression, nor that x1 = x2.
    $endgroup$
    – GuillaumeL
    9 hours ago


















5












$begingroup$


I have a model in which I want to predict Y.



My regressors X, are x1 and x2.



For some reason I believe that it would also be useful to include into the model:



  • x3 = x1 * x2

  • x4 = x1 / x2

Can I use a regressors x1, x2, x3 and x4 altogether or do I risk perfect collinearity problem.
I know for instance that using x5 = x1 + x2 would yield perfect collinearity and hence a completely useless regressor.










share|cite|improve this question









$endgroup$







  • 4




    $begingroup$
    x3 is commonly called an interaction term.
    $endgroup$
    – COOLSerdash
    yesterday










  • $begingroup$
    @scugn1zz0 accept the other answer :)
    $endgroup$
    – gunes
    yesterday






  • 1




    $begingroup$
    To would-be respondents: If you believe the OP's model is always problematic, you are asserting that all quartic polynomial regressions are problematic. To see why that is, suppose that in reality the response $Y$ has a quartic polynomial relationship with a single variable $x:$ $$E[Y]=beta_0+beta_1x+beta_2x^2+beta_3x^3+beta_4x^4.$$ Suppose one has measured not only $x_2=x$ but also--unwittingly--$x_1=x^3.$ In terms of these variables, $x^2=x_1/x_2$ and $x^4=x_1x_2.$ Thus, this model is identical to $$E[Y]=beta_0+beta_1x_2+beta_2x_1/x_2+beta_3x_1+beta_4x_1x_2.$$
    $endgroup$
    – whuber
    yesterday











  • $begingroup$
    OP has not specified the type of regression, nor that x1 = x2.
    $endgroup$
    – GuillaumeL
    9 hours ago














5












5








5





$begingroup$


I have a model in which I want to predict Y.



My regressors X, are x1 and x2.



For some reason I believe that it would also be useful to include into the model:



  • x3 = x1 * x2

  • x4 = x1 / x2

Can I use a regressors x1, x2, x3 and x4 altogether or do I risk perfect collinearity problem.
I know for instance that using x5 = x1 + x2 would yield perfect collinearity and hence a completely useless regressor.










share|cite|improve this question









$endgroup$




I have a model in which I want to predict Y.



My regressors X, are x1 and x2.



For some reason I believe that it would also be useful to include into the model:



  • x3 = x1 * x2

  • x4 = x1 / x2

Can I use a regressors x1, x2, x3 and x4 altogether or do I risk perfect collinearity problem.
I know for instance that using x5 = x1 + x2 would yield perfect collinearity and hence a completely useless regressor.







regression linear-model multicollinearity






share|cite|improve this question













share|cite|improve this question











share|cite|improve this question




share|cite|improve this question










asked yesterday









scugn1zz0scugn1zz0

264




264







  • 4




    $begingroup$
    x3 is commonly called an interaction term.
    $endgroup$
    – COOLSerdash
    yesterday










  • $begingroup$
    @scugn1zz0 accept the other answer :)
    $endgroup$
    – gunes
    yesterday






  • 1




    $begingroup$
    To would-be respondents: If you believe the OP's model is always problematic, you are asserting that all quartic polynomial regressions are problematic. To see why that is, suppose that in reality the response $Y$ has a quartic polynomial relationship with a single variable $x:$ $$E[Y]=beta_0+beta_1x+beta_2x^2+beta_3x^3+beta_4x^4.$$ Suppose one has measured not only $x_2=x$ but also--unwittingly--$x_1=x^3.$ In terms of these variables, $x^2=x_1/x_2$ and $x^4=x_1x_2.$ Thus, this model is identical to $$E[Y]=beta_0+beta_1x_2+beta_2x_1/x_2+beta_3x_1+beta_4x_1x_2.$$
    $endgroup$
    – whuber
    yesterday











  • $begingroup$
    OP has not specified the type of regression, nor that x1 = x2.
    $endgroup$
    – GuillaumeL
    9 hours ago













  • 4




    $begingroup$
    x3 is commonly called an interaction term.
    $endgroup$
    – COOLSerdash
    yesterday










  • $begingroup$
    @scugn1zz0 accept the other answer :)
    $endgroup$
    – gunes
    yesterday






  • 1




    $begingroup$
    To would-be respondents: If you believe the OP's model is always problematic, you are asserting that all quartic polynomial regressions are problematic. To see why that is, suppose that in reality the response $Y$ has a quartic polynomial relationship with a single variable $x:$ $$E[Y]=beta_0+beta_1x+beta_2x^2+beta_3x^3+beta_4x^4.$$ Suppose one has measured not only $x_2=x$ but also--unwittingly--$x_1=x^3.$ In terms of these variables, $x^2=x_1/x_2$ and $x^4=x_1x_2.$ Thus, this model is identical to $$E[Y]=beta_0+beta_1x_2+beta_2x_1/x_2+beta_3x_1+beta_4x_1x_2.$$
    $endgroup$
    – whuber
    yesterday











  • $begingroup$
    OP has not specified the type of regression, nor that x1 = x2.
    $endgroup$
    – GuillaumeL
    9 hours ago








4




4




$begingroup$
x3 is commonly called an interaction term.
$endgroup$
– COOLSerdash
yesterday




$begingroup$
x3 is commonly called an interaction term.
$endgroup$
– COOLSerdash
yesterday












$begingroup$
@scugn1zz0 accept the other answer :)
$endgroup$
– gunes
yesterday




$begingroup$
@scugn1zz0 accept the other answer :)
$endgroup$
– gunes
yesterday




1




1




$begingroup$
To would-be respondents: If you believe the OP's model is always problematic, you are asserting that all quartic polynomial regressions are problematic. To see why that is, suppose that in reality the response $Y$ has a quartic polynomial relationship with a single variable $x:$ $$E[Y]=beta_0+beta_1x+beta_2x^2+beta_3x^3+beta_4x^4.$$ Suppose one has measured not only $x_2=x$ but also--unwittingly--$x_1=x^3.$ In terms of these variables, $x^2=x_1/x_2$ and $x^4=x_1x_2.$ Thus, this model is identical to $$E[Y]=beta_0+beta_1x_2+beta_2x_1/x_2+beta_3x_1+beta_4x_1x_2.$$
$endgroup$
– whuber
yesterday





$begingroup$
To would-be respondents: If you believe the OP's model is always problematic, you are asserting that all quartic polynomial regressions are problematic. To see why that is, suppose that in reality the response $Y$ has a quartic polynomial relationship with a single variable $x:$ $$E[Y]=beta_0+beta_1x+beta_2x^2+beta_3x^3+beta_4x^4.$$ Suppose one has measured not only $x_2=x$ but also--unwittingly--$x_1=x^3.$ In terms of these variables, $x^2=x_1/x_2$ and $x^4=x_1x_2.$ Thus, this model is identical to $$E[Y]=beta_0+beta_1x_2+beta_2x_1/x_2+beta_3x_1+beta_4x_1x_2.$$
$endgroup$
– whuber
yesterday













$begingroup$
OP has not specified the type of regression, nor that x1 = x2.
$endgroup$
– GuillaumeL
9 hours ago





$begingroup$
OP has not specified the type of regression, nor that x1 = x2.
$endgroup$
– GuillaumeL
9 hours ago











4 Answers
4






active

oldest

votes


















5












$begingroup$

You won't have perfect collinearity (as per your question), but you do risk multicollinearity issues with your two additional regressors.



While they're not algebraically linear combinations of the two predictors, it can be the case that these variables (x1-x4) in a particular sample might lay close to to a linear subspace - with the typical consequences of near-multicollinearity.



For example, if the two original variates both have very small coefficients of variation then their product can be quite closely related to their sum (or some other linear combination if they're dissimilar in size). This can happen even if the original variables are not highly correlated.



Similarly, high correlation can happen with the ratio and the difference.






share|cite|improve this answer











$endgroup$








  • 1




    $begingroup$
    +1. As I understand things, you want to center your variables (subtract the mean), which will help to reduce correlation between x1, x2, and x1 * x2.
    $endgroup$
    – Wayne
    yesterday






  • 1




    $begingroup$
    It would certainly help; but then we can come back to other examples; in the case of certain kinds of relationships between x1 and x2, the collection of x1,x2,x1*x2 and x1/x2 can still result in near-multicollinearity
    $endgroup$
    – Glen_b
    yesterday






  • 1




    $begingroup$
    +1 (So far,) this is the only answer with a correct analysis and good advice. @Wayne Centering the variables, however, changes the very meanings of x1*x2 and x1/x2. Moreover, the problem can get worse with centered variables: consider the (very common) case where both x1 and x2 are binary: after centering them, you will obtain exact collinearity among x1*x2 and x1/x2.
    $endgroup$
    – whuber
    yesterday







  • 1




    $begingroup$
    @Wayne This works only because the constant is implicitly included, whence a model in terms of $1,x_1,x_2,$ and $x_1x_2$ is equivalent to a model in terms of $1,x_1-barx_1,x_2-barx_2,$ and $(x_1-barx_1)(x_2-barx_2).$ This analysis doesn't work when applied to ratios like $x_1/x_2.$ An analysis that does work is to construct an orthogonal basis for the data columns $1,x_1,x_2,x_1x_2,$ and $x_1/x_2.$ This generalizes the idea, if not the algebra, and eliminates all collinearity.
    $endgroup$
    – whuber
    yesterday







  • 1




    $begingroup$
    @Wayne Right: that would be one way to orthogonalize them. It's a generalization of orthogonal polynomials. Nothing's free, though: if there is near-multicollinearity of the original variables, it will show up in the form of one or more principal components that are essentially "noise" and likely have estimated regression coefficients with huge standard errors. But getting even this far can be informative.
    $endgroup$
    – whuber
    yesterday


















1












$begingroup$

Operating under your stated assumption that $x_3=x_1x_2$ and $x_4=x_1/x_2$ need to be entertained as possible explanatory variables in a model of a response $Y$ (and therefore not summarily dropped because they might be a little inconvenient), it can be helpful to consider alternative ways of expressing this model.



As stated, the model is of the form



$$Y sim F(x_1, x_2, x_3, x_4; theta) = F(x_1, x_2, x_1x_2, x_1/x_2; theta)$$



for a given distribution family $F$ involving unknown parameters $theta$ to be determined. For instance, a linear regression model would involve a five-dimensional parameter $theta = (beta_0, beta_1, beta_2, beta_3, beta_4)$ in the form



$$E[Y] = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_3 + beta_4 x_4 = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_1x_2 + beta_4 x_1/x_2.$$



For simplicity of exposition, let's analyse the linear regression model: it will be clear how the analysis extends to other models.



One way is to restate the model in terms of $x_4$ and $x_2,$ which algebraically imply $x_1=x_2x_4$ and $x_3=x_2^2x_4:$



$$E[Y] = beta_0 + beta_1 x_4x_2 + beta_2 x_2 + beta_3 x_4x_2^2 + beta_4 x_4 = beta_0 + beta_2 x_2 + beta_4 x_4 + x_4left(beta_1 x_2 + beta_3 x_2^2right).$$



The last term would ordinarily be characterized as an interaction between $x_4$ and a quadratic function of $x_2.$ Since, except in very special circumstances, interactions should be included only when their component terms are included, this suggests you ought to extend to model to include an $x_2^2$ term. It would have the form



$$E[Y] = beta_0 + left(beta_2 x_2 + beta_5 x_2^2right) + beta_4 x_4 + x_4left(beta_1 x_2 + beta_3 x_2^2right).$$



That is a model involving (a) $x_4$ and (b) the simplest possible quadratic spline of $x_2.$ Such models are common: the quadratic terms allow for some amount of nonlinear response in $x_2$ and the interaction allows for the response to change with different values of $x_4$ in a controlled way.



These simple algebraic manipulations demonstrate that the proposed model is not at all unusual. They reframe it in terms of standard, well-understood concepts.




There remains the question of collinearity. That collinearity could be a problem is demonstrated by the case where both $x_1$ and $x_2$ are binary variables coded as $pm 1.$ In this case, $x_1/x_2$ and $x_1x_2$ are always equal (not just collinear).



On the other hand, that collinearity might not be much of a problem can be demonstrated by exhibiting some sample data with relatively little collinearity. We would want $x_2$ to be orthogonal to $x_2^2,$ of course, and then everything will be ok provided the interactions don't introduce collinearity. Unfortunately, $x_4$ and $x_4x_2^2$ are likely to be positively correlated. But by how much?



Consider the data $x_2 = (-1,0,1,, -1,0,1,, -1,0,1)$ and $x_4 = (-1,sqrt3,-1,,0,0,0,,1,-sqrt3,1).$ The covariance matrix of the columns $(x_2, x_2^2, x_4x_2, x_4, x_4x_2^2)$ is



$$pmatrix3&0&0&0&0 \ 0 & 1 & 0&0&0 \ 0&0&2&0&0 \ 0&0&0&5&2 \ 0&0&0&2&2/4.$$



It is nearly orthogonal, with correlation only between the last two variables (as expected). (Notice that introducing $x_2^2$ has not changed anything, because this variable is orthogonal to all the others.) The ratio of the largest to the smallest eigenvalue (its condition number) is $6.$ This is not beautiful, but it's not bad, either. One could easily obtain reliable coefficient estimates with such explanatory variables.



If you don't have the luxury of choosing the values of $x_2$ and $x_4$ to arrange such near-orthogonality, then you will simply have to proceed as anyone would always do in such cases: investigate the data you have and deal with any collinearity in the usual ways (which would include ignoring it; dropping variables based on scientific considerations; selecting some principal components; using a Lasso; and so on).






share|cite|improve this answer









$endgroup$




















    -1












    $begingroup$

    No, you don’t risk collinearity because $x_i$ are not linearly dependent in general, i.e. the below equation has just one solution holding for all possible $x_i$:
    $$a_1x_1+a_2x_2+a_3x_3+a_4x_4=0$$
    And that is $a_i=0$.



    In $x_5=x_1+x_2$ case, the following equation has non-zero solutions such that $a_1=a_2=-a_5$: $$a_1x_1+a_2x_2+a_5x_5=0$$






    share|cite|improve this answer











    $endgroup$








    • 1




      $begingroup$
      -1 Although usually $x_1, x_2, x_1x_2$ are not collinear, they can be extremely close to being so, which is the usual understanding of "collinear" in a regression setting. Moreover, there are plenty of special cases--which actually occur--where there are multiple solutions to your first equation. Consider binary variables $x_i$ coded as $pm 1,$ for instance.
      $endgroup$
      – whuber
      yesterday










    • $begingroup$
      @whuber, right, I tried to note down the situation as in general to actually refer to such cases (e.g. binary vars, domain limited variables etc.). But, I wouldn’t consider the situation of being extremely close as collinear, which is why I never commented on the situation. Because, the determinant of $X^TX$ is not $0$. Even if it is close to $0$. when we had infinite precision, we would again have a unique solution, although numerically we obtain very bad solutions. Anyway, I asked the OP to change the accepted answer. Thanks for your comment.
      $endgroup$
      – gunes
      yesterday



















    -1












    $begingroup$

    Don't do



    y = x1 + x2 + x3 + x3


    ...which is equivalent in your case to



     y = x1 + x2 + (x1 * x2) + (x1 / x2)


    Just create your model with this formula:



    y = x1 * x2. 


    The * sign indicates that you are also using the interaction effect, the equivalent of your x3. Alternate notation for this:



    y = x1 + x2 + x1:x2


    Your x3 is already taken into account when you include the interaction term, and your x4 will have huge multicolinearity problem with x1 and x2 so I would just ditch it.



    Here's an example. We create a model with two predictors and the interaction effect. Note that I could also use the formula Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length with the same result:



    > model <- lm(data = iris, Sepal.Length ~ Sepal.Width * Petal.Length)
    > summary(model)

    Call:
    lm(formula = Sepal.Length ~ Sepal.Width * Petal.Length, data = iris)

    Residuals:
    Min 1Q Median 3Q Max
    -0.99594 -0.21165 -0.01652 0.21244 0.77249

    Coefficients:
    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 1.40438 0.53253 2.637 0.00926 **
    Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
    Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
    Sepal.Width:Petal.Length -0.07701 0.04305 -1.789 0.07571 .
    ---
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.3308 on 146 degrees of freedom
    Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
    F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


    Then I add a variable for the product and include it in the model instead of the interaction:



    > 
    > iris$prod <- iris$Sepal.Width * iris$Petal.Length
    > model.prod <-
    + lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Length + prod)
    > summary(model.prod)

    Call:
    lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + prod,
    data = iris)

    Residuals:
    Min 1Q Median 3Q Max
    -0.99594 -0.21165 -0.01652 0.21244 0.77249

    Coefficients:
    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 1.40438 0.53253 2.637 0.00926 **
    Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
    Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
    prod -0.07701 0.04305 -1.789 0.07571 .
    ---
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.3308 on 146 degrees of freedom
    Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
    F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


    So, it's really the same thing. I can emphasize this by explicitely asking R to include both the product variable and the interaction term:



    > model.inter <-
    + lm(data = iris,
    + Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length + prod)
    > summary(model.inter)

    Call:
    lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length +
    prod, data = iris)

    Residuals:
    Min 1Q Median 3Q Max
    -0.99594 -0.21165 -0.01652 0.21244 0.77249

    Coefficients: (1 not defined because of singularities)
    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 1.40438 0.53253 2.637 0.00926 **
    Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
    Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
    prod -0.07701 0.04305 -1.789 0.07571 .
    Sepal.Width:Petal.Length NA NA NA NA
    ---
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.3308 on 146 degrees of freedom
    Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
    F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


    Using the dividend (x1 / x2) as well as the interaction actually diminished the adjusted R2 a little bit since it does not contribute much and I had to add another predictor. It did improve the residuals, but just one tiny bit.



    > model.divid <-
    + lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Length + divid + prod)
    > summary(model.divid)

    Call:
    lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + divid +
    prod, data = iris)

    Residuals:
    Min 1Q Median 3Q Max
    -0.98673 -0.21684 0.00684 0.21559 0.71138

    Coefficients:
    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 1.58216 0.53349 2.966 0.00353 **
    Sepal.Width 0.55849 0.20998 2.660 0.00870 **
    Petal.Length 0.72299 0.13733 5.265 4.97e-07 ***
    divid 0.28102 0.13526 2.078 0.03951 *
    prod -0.04455 0.04535 -0.982 0.32750
    ---
    Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.3271 on 145 degrees of freedom
    Multiple R-squared: 0.8481, Adjusted R-squared: 0.8439
    F-statistic: 202.4 on 4 and 145 DF, p-value: < 2.2e-16


    So why not use the dividend of x1 and x2? The problem is that you'd get multicolinearity between the dividend variable and x1 + x2. I won't include the code and output for brevity, but in my example, the adjusted R2-squared for divid ~ Sepal.Width + Petal.Length was 94%, with a p value < 2.2e-16.






    share|cite|improve this answer











    $endgroup$












    • $begingroup$
      Can I get some explanation for the downvote so that I can fix my answer? I gave a full explanation with demo.
      $endgroup$
      – GuillaumeL
      9 hours ago






    • 2




      $begingroup$
      There are several incorrect statements in this answer, such as "your x4 will have huge multicolinearity problem with x1 and x2 so I would just ditch it" and "The problem is that you'd get multicolinearity between the dividend variable and x1 + x2". Even if the assumption of the first statement were true in some specific case, the recommendation to drop x4 without further consideration is ill-advised. The initial discussion of how to fit this model in R is wrong: x1*x2 will not include the x1/x2 term.
      $endgroup$
      – whuber
      7 hours ago












    Your Answer





    StackExchange.ifUsing("editor", function ()
    return StackExchange.using("mathjaxEditing", function ()
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    );
    );
    , "mathjax-editing");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "65"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f401487%2fcan-i-include-the-product-of-two-random-variables-or-do-i-risk-collinearity%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    4 Answers
    4






    active

    oldest

    votes








    4 Answers
    4






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    5












    $begingroup$

    You won't have perfect collinearity (as per your question), but you do risk multicollinearity issues with your two additional regressors.



    While they're not algebraically linear combinations of the two predictors, it can be the case that these variables (x1-x4) in a particular sample might lay close to to a linear subspace - with the typical consequences of near-multicollinearity.



    For example, if the two original variates both have very small coefficients of variation then their product can be quite closely related to their sum (or some other linear combination if they're dissimilar in size). This can happen even if the original variables are not highly correlated.



    Similarly, high correlation can happen with the ratio and the difference.






    share|cite|improve this answer











    $endgroup$








    • 1




      $begingroup$
      +1. As I understand things, you want to center your variables (subtract the mean), which will help to reduce correlation between x1, x2, and x1 * x2.
      $endgroup$
      – Wayne
      yesterday






    • 1




      $begingroup$
      It would certainly help; but then we can come back to other examples; in the case of certain kinds of relationships between x1 and x2, the collection of x1,x2,x1*x2 and x1/x2 can still result in near-multicollinearity
      $endgroup$
      – Glen_b
      yesterday






    • 1




      $begingroup$
      +1 (So far,) this is the only answer with a correct analysis and good advice. @Wayne Centering the variables, however, changes the very meanings of x1*x2 and x1/x2. Moreover, the problem can get worse with centered variables: consider the (very common) case where both x1 and x2 are binary: after centering them, you will obtain exact collinearity among x1*x2 and x1/x2.
      $endgroup$
      – whuber
      yesterday







    • 1




      $begingroup$
      @Wayne This works only because the constant is implicitly included, whence a model in terms of $1,x_1,x_2,$ and $x_1x_2$ is equivalent to a model in terms of $1,x_1-barx_1,x_2-barx_2,$ and $(x_1-barx_1)(x_2-barx_2).$ This analysis doesn't work when applied to ratios like $x_1/x_2.$ An analysis that does work is to construct an orthogonal basis for the data columns $1,x_1,x_2,x_1x_2,$ and $x_1/x_2.$ This generalizes the idea, if not the algebra, and eliminates all collinearity.
      $endgroup$
      – whuber
      yesterday







    • 1




      $begingroup$
      @Wayne Right: that would be one way to orthogonalize them. It's a generalization of orthogonal polynomials. Nothing's free, though: if there is near-multicollinearity of the original variables, it will show up in the form of one or more principal components that are essentially "noise" and likely have estimated regression coefficients with huge standard errors. But getting even this far can be informative.
      $endgroup$
      – whuber
      yesterday















    5












    $begingroup$

    You won't have perfect collinearity (as per your question), but you do risk multicollinearity issues with your two additional regressors.



    While they're not algebraically linear combinations of the two predictors, it can be the case that these variables (x1-x4) in a particular sample might lay close to to a linear subspace - with the typical consequences of near-multicollinearity.



    For example, if the two original variates both have very small coefficients of variation then their product can be quite closely related to their sum (or some other linear combination if they're dissimilar in size). This can happen even if the original variables are not highly correlated.



    Similarly, high correlation can happen with the ratio and the difference.






    share|cite|improve this answer











    $endgroup$








    • 1




      $begingroup$
      +1. As I understand things, you want to center your variables (subtract the mean), which will help to reduce correlation between x1, x2, and x1 * x2.
      $endgroup$
      – Wayne
      yesterday






    • 1




      $begingroup$
      It would certainly help; but then we can come back to other examples; in the case of certain kinds of relationships between x1 and x2, the collection of x1,x2,x1*x2 and x1/x2 can still result in near-multicollinearity
      $endgroup$
      – Glen_b
      yesterday






    • 1




      $begingroup$
      +1 (So far,) this is the only answer with a correct analysis and good advice. @Wayne Centering the variables, however, changes the very meanings of x1*x2 and x1/x2. Moreover, the problem can get worse with centered variables: consider the (very common) case where both x1 and x2 are binary: after centering them, you will obtain exact collinearity among x1*x2 and x1/x2.
      $endgroup$
      – whuber
      yesterday







    • 1




      $begingroup$
      @Wayne This works only because the constant is implicitly included, whence a model in terms of $1,x_1,x_2,$ and $x_1x_2$ is equivalent to a model in terms of $1,x_1-barx_1,x_2-barx_2,$ and $(x_1-barx_1)(x_2-barx_2).$ This analysis doesn't work when applied to ratios like $x_1/x_2.$ An analysis that does work is to construct an orthogonal basis for the data columns $1,x_1,x_2,x_1x_2,$ and $x_1/x_2.$ This generalizes the idea, if not the algebra, and eliminates all collinearity.
      $endgroup$
      – whuber
      yesterday







    • 1




      $begingroup$
      @Wayne Right: that would be one way to orthogonalize them. It's a generalization of orthogonal polynomials. Nothing's free, though: if there is near-multicollinearity of the original variables, it will show up in the form of one or more principal components that are essentially "noise" and likely have estimated regression coefficients with huge standard errors. But getting even this far can be informative.
      $endgroup$
      – whuber
      yesterday













    5












    5








    5





    $begingroup$

    You won't have perfect collinearity (as per your question), but you do risk multicollinearity issues with your two additional regressors.



    While they're not algebraically linear combinations of the two predictors, it can be the case that these variables (x1-x4) in a particular sample might lay close to to a linear subspace - with the typical consequences of near-multicollinearity.



    For example, if the two original variates both have very small coefficients of variation then their product can be quite closely related to their sum (or some other linear combination if they're dissimilar in size). This can happen even if the original variables are not highly correlated.



    Similarly, high correlation can happen with the ratio and the difference.






    share|cite|improve this answer











    $endgroup$



    You won't have perfect collinearity (as per your question), but you do risk multicollinearity issues with your two additional regressors.



    While they're not algebraically linear combinations of the two predictors, it can be the case that these variables (x1-x4) in a particular sample might lay close to to a linear subspace - with the typical consequences of near-multicollinearity.



    For example, if the two original variates both have very small coefficients of variation then their product can be quite closely related to their sum (or some other linear combination if they're dissimilar in size). This can happen even if the original variables are not highly correlated.



    Similarly, high correlation can happen with the ratio and the difference.







    share|cite|improve this answer














    share|cite|improve this answer



    share|cite|improve this answer








    edited 20 hours ago

























    answered yesterday









    Glen_bGlen_b

    215k23418771




    215k23418771







    • 1




      $begingroup$
      +1. As I understand things, you want to center your variables (subtract the mean), which will help to reduce correlation between x1, x2, and x1 * x2.
      $endgroup$
      – Wayne
      yesterday






    • 1




      $begingroup$
      It would certainly help; but then we can come back to other examples; in the case of certain kinds of relationships between x1 and x2, the collection of x1,x2,x1*x2 and x1/x2 can still result in near-multicollinearity
      $endgroup$
      – Glen_b
      yesterday






    • 1




      $begingroup$
      +1 (So far,) this is the only answer with a correct analysis and good advice. @Wayne Centering the variables, however, changes the very meanings of x1*x2 and x1/x2. Moreover, the problem can get worse with centered variables: consider the (very common) case where both x1 and x2 are binary: after centering them, you will obtain exact collinearity among x1*x2 and x1/x2.
      $endgroup$
      – whuber
      yesterday







    • 1




      $begingroup$
      @Wayne This works only because the constant is implicitly included, whence a model in terms of $1,x_1,x_2,$ and $x_1x_2$ is equivalent to a model in terms of $1,x_1-barx_1,x_2-barx_2,$ and $(x_1-barx_1)(x_2-barx_2).$ This analysis doesn't work when applied to ratios like $x_1/x_2.$ An analysis that does work is to construct an orthogonal basis for the data columns $1,x_1,x_2,x_1x_2,$ and $x_1/x_2.$ This generalizes the idea, if not the algebra, and eliminates all collinearity.
      $endgroup$
      – whuber
      yesterday







    • 1




      $begingroup$
      @Wayne Right: that would be one way to orthogonalize them. It's a generalization of orthogonal polynomials. Nothing's free, though: if there is near-multicollinearity of the original variables, it will show up in the form of one or more principal components that are essentially "noise" and likely have estimated regression coefficients with huge standard errors. But getting even this far can be informative.
      $endgroup$
      – whuber
      yesterday












    • 1




      $begingroup$
      +1. As I understand things, you want to center your variables (subtract the mean), which will help to reduce correlation between x1, x2, and x1 * x2.
      $endgroup$
      – Wayne
      yesterday






    • 1




      $begingroup$
      It would certainly help; but then we can come back to other examples; in the case of certain kinds of relationships between x1 and x2, the collection of x1,x2,x1*x2 and x1/x2 can still result in near-multicollinearity
      $endgroup$
      – Glen_b
      yesterday






    • 1




      $begingroup$
      +1 (So far,) this is the only answer with a correct analysis and good advice. @Wayne Centering the variables, however, changes the very meanings of x1*x2 and x1/x2. Moreover, the problem can get worse with centered variables: consider the (very common) case where both x1 and x2 are binary: after centering them, you will obtain exact collinearity among x1*x2 and x1/x2.
      $endgroup$
      – whuber
      yesterday







    • 1




      $begingroup$
      @Wayne This works only because the constant is implicitly included, whence a model in terms of $1,x_1,x_2,$ and $x_1x_2$ is equivalent to a model in terms of $1,x_1-barx_1,x_2-barx_2,$ and $(x_1-barx_1)(x_2-barx_2).$ This analysis doesn't work when applied to ratios like $x_1/x_2.$ An analysis that does work is to construct an orthogonal basis for the data columns $1,x_1,x_2,x_1x_2,$ and $x_1/x_2.$ This generalizes the idea, if not the algebra, and eliminates all collinearity.
      $endgroup$
      – whuber
      yesterday







    • 1




      $begingroup$
      @Wayne Right: that would be one way to orthogonalize them. It's a generalization of orthogonal polynomials. Nothing's free, though: if there is near-multicollinearity of the original variables, it will show up in the form of one or more principal components that are essentially "noise" and likely have estimated regression coefficients with huge standard errors. But getting even this far can be informative.
      $endgroup$
      – whuber
      yesterday







    1




    1




    $begingroup$
    +1. As I understand things, you want to center your variables (subtract the mean), which will help to reduce correlation between x1, x2, and x1 * x2.
    $endgroup$
    – Wayne
    yesterday




    $begingroup$
    +1. As I understand things, you want to center your variables (subtract the mean), which will help to reduce correlation between x1, x2, and x1 * x2.
    $endgroup$
    – Wayne
    yesterday




    1




    1




    $begingroup$
    It would certainly help; but then we can come back to other examples; in the case of certain kinds of relationships between x1 and x2, the collection of x1,x2,x1*x2 and x1/x2 can still result in near-multicollinearity
    $endgroup$
    – Glen_b
    yesterday




    $begingroup$
    It would certainly help; but then we can come back to other examples; in the case of certain kinds of relationships between x1 and x2, the collection of x1,x2,x1*x2 and x1/x2 can still result in near-multicollinearity
    $endgroup$
    – Glen_b
    yesterday




    1




    1




    $begingroup$
    +1 (So far,) this is the only answer with a correct analysis and good advice. @Wayne Centering the variables, however, changes the very meanings of x1*x2 and x1/x2. Moreover, the problem can get worse with centered variables: consider the (very common) case where both x1 and x2 are binary: after centering them, you will obtain exact collinearity among x1*x2 and x1/x2.
    $endgroup$
    – whuber
    yesterday





    $begingroup$
    +1 (So far,) this is the only answer with a correct analysis and good advice. @Wayne Centering the variables, however, changes the very meanings of x1*x2 and x1/x2. Moreover, the problem can get worse with centered variables: consider the (very common) case where both x1 and x2 are binary: after centering them, you will obtain exact collinearity among x1*x2 and x1/x2.
    $endgroup$
    – whuber
    yesterday





    1




    1




    $begingroup$
    @Wayne This works only because the constant is implicitly included, whence a model in terms of $1,x_1,x_2,$ and $x_1x_2$ is equivalent to a model in terms of $1,x_1-barx_1,x_2-barx_2,$ and $(x_1-barx_1)(x_2-barx_2).$ This analysis doesn't work when applied to ratios like $x_1/x_2.$ An analysis that does work is to construct an orthogonal basis for the data columns $1,x_1,x_2,x_1x_2,$ and $x_1/x_2.$ This generalizes the idea, if not the algebra, and eliminates all collinearity.
    $endgroup$
    – whuber
    yesterday





    $begingroup$
    @Wayne This works only because the constant is implicitly included, whence a model in terms of $1,x_1,x_2,$ and $x_1x_2$ is equivalent to a model in terms of $1,x_1-barx_1,x_2-barx_2,$ and $(x_1-barx_1)(x_2-barx_2).$ This analysis doesn't work when applied to ratios like $x_1/x_2.$ An analysis that does work is to construct an orthogonal basis for the data columns $1,x_1,x_2,x_1x_2,$ and $x_1/x_2.$ This generalizes the idea, if not the algebra, and eliminates all collinearity.
    $endgroup$
    – whuber
    yesterday





    1




    1




    $begingroup$
    @Wayne Right: that would be one way to orthogonalize them. It's a generalization of orthogonal polynomials. Nothing's free, though: if there is near-multicollinearity of the original variables, it will show up in the form of one or more principal components that are essentially "noise" and likely have estimated regression coefficients with huge standard errors. But getting even this far can be informative.
    $endgroup$
    – whuber
    yesterday




    $begingroup$
    @Wayne Right: that would be one way to orthogonalize them. It's a generalization of orthogonal polynomials. Nothing's free, though: if there is near-multicollinearity of the original variables, it will show up in the form of one or more principal components that are essentially "noise" and likely have estimated regression coefficients with huge standard errors. But getting even this far can be informative.
    $endgroup$
    – whuber
    yesterday













    1












    $begingroup$

    Operating under your stated assumption that $x_3=x_1x_2$ and $x_4=x_1/x_2$ need to be entertained as possible explanatory variables in a model of a response $Y$ (and therefore not summarily dropped because they might be a little inconvenient), it can be helpful to consider alternative ways of expressing this model.



    As stated, the model is of the form



    $$Y sim F(x_1, x_2, x_3, x_4; theta) = F(x_1, x_2, x_1x_2, x_1/x_2; theta)$$



    for a given distribution family $F$ involving unknown parameters $theta$ to be determined. For instance, a linear regression model would involve a five-dimensional parameter $theta = (beta_0, beta_1, beta_2, beta_3, beta_4)$ in the form



    $$E[Y] = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_3 + beta_4 x_4 = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_1x_2 + beta_4 x_1/x_2.$$



    For simplicity of exposition, let's analyse the linear regression model: it will be clear how the analysis extends to other models.



    One way is to restate the model in terms of $x_4$ and $x_2,$ which algebraically imply $x_1=x_2x_4$ and $x_3=x_2^2x_4:$



    $$E[Y] = beta_0 + beta_1 x_4x_2 + beta_2 x_2 + beta_3 x_4x_2^2 + beta_4 x_4 = beta_0 + beta_2 x_2 + beta_4 x_4 + x_4left(beta_1 x_2 + beta_3 x_2^2right).$$



    The last term would ordinarily be characterized as an interaction between $x_4$ and a quadratic function of $x_2.$ Since, except in very special circumstances, interactions should be included only when their component terms are included, this suggests you ought to extend to model to include an $x_2^2$ term. It would have the form



    $$E[Y] = beta_0 + left(beta_2 x_2 + beta_5 x_2^2right) + beta_4 x_4 + x_4left(beta_1 x_2 + beta_3 x_2^2right).$$



    That is a model involving (a) $x_4$ and (b) the simplest possible quadratic spline of $x_2.$ Such models are common: the quadratic terms allow for some amount of nonlinear response in $x_2$ and the interaction allows for the response to change with different values of $x_4$ in a controlled way.



    These simple algebraic manipulations demonstrate that the proposed model is not at all unusual. They reframe it in terms of standard, well-understood concepts.




    There remains the question of collinearity. That collinearity could be a problem is demonstrated by the case where both $x_1$ and $x_2$ are binary variables coded as $pm 1.$ In this case, $x_1/x_2$ and $x_1x_2$ are always equal (not just collinear).



    On the other hand, that collinearity might not be much of a problem can be demonstrated by exhibiting some sample data with relatively little collinearity. We would want $x_2$ to be orthogonal to $x_2^2,$ of course, and then everything will be ok provided the interactions don't introduce collinearity. Unfortunately, $x_4$ and $x_4x_2^2$ are likely to be positively correlated. But by how much?



    Consider the data $x_2 = (-1,0,1,, -1,0,1,, -1,0,1)$ and $x_4 = (-1,sqrt3,-1,,0,0,0,,1,-sqrt3,1).$ The covariance matrix of the columns $(x_2, x_2^2, x_4x_2, x_4, x_4x_2^2)$ is



    $$pmatrix3&0&0&0&0 \ 0 & 1 & 0&0&0 \ 0&0&2&0&0 \ 0&0&0&5&2 \ 0&0&0&2&2/4.$$



    It is nearly orthogonal, with correlation only between the last two variables (as expected). (Notice that introducing $x_2^2$ has not changed anything, because this variable is orthogonal to all the others.) The ratio of the largest to the smallest eigenvalue (its condition number) is $6.$ This is not beautiful, but it's not bad, either. One could easily obtain reliable coefficient estimates with such explanatory variables.



    If you don't have the luxury of choosing the values of $x_2$ and $x_4$ to arrange such near-orthogonality, then you will simply have to proceed as anyone would always do in such cases: investigate the data you have and deal with any collinearity in the usual ways (which would include ignoring it; dropping variables based on scientific considerations; selecting some principal components; using a Lasso; and so on).






    share|cite|improve this answer









    $endgroup$

















      1












      $begingroup$

      Operating under your stated assumption that $x_3=x_1x_2$ and $x_4=x_1/x_2$ need to be entertained as possible explanatory variables in a model of a response $Y$ (and therefore not summarily dropped because they might be a little inconvenient), it can be helpful to consider alternative ways of expressing this model.



      As stated, the model is of the form



      $$Y sim F(x_1, x_2, x_3, x_4; theta) = F(x_1, x_2, x_1x_2, x_1/x_2; theta)$$



      for a given distribution family $F$ involving unknown parameters $theta$ to be determined. For instance, a linear regression model would involve a five-dimensional parameter $theta = (beta_0, beta_1, beta_2, beta_3, beta_4)$ in the form



      $$E[Y] = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_3 + beta_4 x_4 = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_1x_2 + beta_4 x_1/x_2.$$



      For simplicity of exposition, let's analyse the linear regression model: it will be clear how the analysis extends to other models.



      One way is to restate the model in terms of $x_4$ and $x_2,$ which algebraically imply $x_1=x_2x_4$ and $x_3=x_2^2x_4:$



      $$E[Y] = beta_0 + beta_1 x_4x_2 + beta_2 x_2 + beta_3 x_4x_2^2 + beta_4 x_4 = beta_0 + beta_2 x_2 + beta_4 x_4 + x_4left(beta_1 x_2 + beta_3 x_2^2right).$$



      The last term would ordinarily be characterized as an interaction between $x_4$ and a quadratic function of $x_2.$ Since, except in very special circumstances, interactions should be included only when their component terms are included, this suggests you ought to extend to model to include an $x_2^2$ term. It would have the form



      $$E[Y] = beta_0 + left(beta_2 x_2 + beta_5 x_2^2right) + beta_4 x_4 + x_4left(beta_1 x_2 + beta_3 x_2^2right).$$



      That is a model involving (a) $x_4$ and (b) the simplest possible quadratic spline of $x_2.$ Such models are common: the quadratic terms allow for some amount of nonlinear response in $x_2$ and the interaction allows for the response to change with different values of $x_4$ in a controlled way.



      These simple algebraic manipulations demonstrate that the proposed model is not at all unusual. They reframe it in terms of standard, well-understood concepts.




      There remains the question of collinearity. That collinearity could be a problem is demonstrated by the case where both $x_1$ and $x_2$ are binary variables coded as $pm 1.$ In this case, $x_1/x_2$ and $x_1x_2$ are always equal (not just collinear).



      On the other hand, that collinearity might not be much of a problem can be demonstrated by exhibiting some sample data with relatively little collinearity. We would want $x_2$ to be orthogonal to $x_2^2,$ of course, and then everything will be ok provided the interactions don't introduce collinearity. Unfortunately, $x_4$ and $x_4x_2^2$ are likely to be positively correlated. But by how much?



      Consider the data $x_2 = (-1,0,1,, -1,0,1,, -1,0,1)$ and $x_4 = (-1,sqrt3,-1,,0,0,0,,1,-sqrt3,1).$ The covariance matrix of the columns $(x_2, x_2^2, x_4x_2, x_4, x_4x_2^2)$ is



      $$pmatrix3&0&0&0&0 \ 0 & 1 & 0&0&0 \ 0&0&2&0&0 \ 0&0&0&5&2 \ 0&0&0&2&2/4.$$



      It is nearly orthogonal, with correlation only between the last two variables (as expected). (Notice that introducing $x_2^2$ has not changed anything, because this variable is orthogonal to all the others.) The ratio of the largest to the smallest eigenvalue (its condition number) is $6.$ This is not beautiful, but it's not bad, either. One could easily obtain reliable coefficient estimates with such explanatory variables.



      If you don't have the luxury of choosing the values of $x_2$ and $x_4$ to arrange such near-orthogonality, then you will simply have to proceed as anyone would always do in such cases: investigate the data you have and deal with any collinearity in the usual ways (which would include ignoring it; dropping variables based on scientific considerations; selecting some principal components; using a Lasso; and so on).






      share|cite|improve this answer









      $endgroup$















        1












        1








        1





        $begingroup$

        Operating under your stated assumption that $x_3=x_1x_2$ and $x_4=x_1/x_2$ need to be entertained as possible explanatory variables in a model of a response $Y$ (and therefore not summarily dropped because they might be a little inconvenient), it can be helpful to consider alternative ways of expressing this model.



        As stated, the model is of the form



        $$Y sim F(x_1, x_2, x_3, x_4; theta) = F(x_1, x_2, x_1x_2, x_1/x_2; theta)$$



        for a given distribution family $F$ involving unknown parameters $theta$ to be determined. For instance, a linear regression model would involve a five-dimensional parameter $theta = (beta_0, beta_1, beta_2, beta_3, beta_4)$ in the form



        $$E[Y] = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_3 + beta_4 x_4 = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_1x_2 + beta_4 x_1/x_2.$$



        For simplicity of exposition, let's analyse the linear regression model: it will be clear how the analysis extends to other models.



        One way is to restate the model in terms of $x_4$ and $x_2,$ which algebraically imply $x_1=x_2x_4$ and $x_3=x_2^2x_4:$



        $$E[Y] = beta_0 + beta_1 x_4x_2 + beta_2 x_2 + beta_3 x_4x_2^2 + beta_4 x_4 = beta_0 + beta_2 x_2 + beta_4 x_4 + x_4left(beta_1 x_2 + beta_3 x_2^2right).$$



        The last term would ordinarily be characterized as an interaction between $x_4$ and a quadratic function of $x_2.$ Since, except in very special circumstances, interactions should be included only when their component terms are included, this suggests you ought to extend to model to include an $x_2^2$ term. It would have the form



        $$E[Y] = beta_0 + left(beta_2 x_2 + beta_5 x_2^2right) + beta_4 x_4 + x_4left(beta_1 x_2 + beta_3 x_2^2right).$$



        That is a model involving (a) $x_4$ and (b) the simplest possible quadratic spline of $x_2.$ Such models are common: the quadratic terms allow for some amount of nonlinear response in $x_2$ and the interaction allows for the response to change with different values of $x_4$ in a controlled way.



        These simple algebraic manipulations demonstrate that the proposed model is not at all unusual. They reframe it in terms of standard, well-understood concepts.




        There remains the question of collinearity. That collinearity could be a problem is demonstrated by the case where both $x_1$ and $x_2$ are binary variables coded as $pm 1.$ In this case, $x_1/x_2$ and $x_1x_2$ are always equal (not just collinear).



        On the other hand, that collinearity might not be much of a problem can be demonstrated by exhibiting some sample data with relatively little collinearity. We would want $x_2$ to be orthogonal to $x_2^2,$ of course, and then everything will be ok provided the interactions don't introduce collinearity. Unfortunately, $x_4$ and $x_4x_2^2$ are likely to be positively correlated. But by how much?



        Consider the data $x_2 = (-1,0,1,, -1,0,1,, -1,0,1)$ and $x_4 = (-1,sqrt3,-1,,0,0,0,,1,-sqrt3,1).$ The covariance matrix of the columns $(x_2, x_2^2, x_4x_2, x_4, x_4x_2^2)$ is



        $$pmatrix3&0&0&0&0 \ 0 & 1 & 0&0&0 \ 0&0&2&0&0 \ 0&0&0&5&2 \ 0&0&0&2&2/4.$$



        It is nearly orthogonal, with correlation only between the last two variables (as expected). (Notice that introducing $x_2^2$ has not changed anything, because this variable is orthogonal to all the others.) The ratio of the largest to the smallest eigenvalue (its condition number) is $6.$ This is not beautiful, but it's not bad, either. One could easily obtain reliable coefficient estimates with such explanatory variables.



        If you don't have the luxury of choosing the values of $x_2$ and $x_4$ to arrange such near-orthogonality, then you will simply have to proceed as anyone would always do in such cases: investigate the data you have and deal with any collinearity in the usual ways (which would include ignoring it; dropping variables based on scientific considerations; selecting some principal components; using a Lasso; and so on).






        share|cite|improve this answer









        $endgroup$



        Operating under your stated assumption that $x_3=x_1x_2$ and $x_4=x_1/x_2$ need to be entertained as possible explanatory variables in a model of a response $Y$ (and therefore not summarily dropped because they might be a little inconvenient), it can be helpful to consider alternative ways of expressing this model.



        As stated, the model is of the form



        $$Y sim F(x_1, x_2, x_3, x_4; theta) = F(x_1, x_2, x_1x_2, x_1/x_2; theta)$$



        for a given distribution family $F$ involving unknown parameters $theta$ to be determined. For instance, a linear regression model would involve a five-dimensional parameter $theta = (beta_0, beta_1, beta_2, beta_3, beta_4)$ in the form



        $$E[Y] = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_3 + beta_4 x_4 = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_1x_2 + beta_4 x_1/x_2.$$



        For simplicity of exposition, let's analyse the linear regression model: it will be clear how the analysis extends to other models.



        One way is to restate the model in terms of $x_4$ and $x_2,$ which algebraically imply $x_1=x_2x_4$ and $x_3=x_2^2x_4:$



        $$E[Y] = beta_0 + beta_1 x_4x_2 + beta_2 x_2 + beta_3 x_4x_2^2 + beta_4 x_4 = beta_0 + beta_2 x_2 + beta_4 x_4 + x_4left(beta_1 x_2 + beta_3 x_2^2right).$$



        The last term would ordinarily be characterized as an interaction between $x_4$ and a quadratic function of $x_2.$ Since, except in very special circumstances, interactions should be included only when their component terms are included, this suggests you ought to extend to model to include an $x_2^2$ term. It would have the form



        $$E[Y] = beta_0 + left(beta_2 x_2 + beta_5 x_2^2right) + beta_4 x_4 + x_4left(beta_1 x_2 + beta_3 x_2^2right).$$



        That is a model involving (a) $x_4$ and (b) the simplest possible quadratic spline of $x_2.$ Such models are common: the quadratic terms allow for some amount of nonlinear response in $x_2$ and the interaction allows for the response to change with different values of $x_4$ in a controlled way.



        These simple algebraic manipulations demonstrate that the proposed model is not at all unusual. They reframe it in terms of standard, well-understood concepts.




        There remains the question of collinearity. That collinearity could be a problem is demonstrated by the case where both $x_1$ and $x_2$ are binary variables coded as $pm 1.$ In this case, $x_1/x_2$ and $x_1x_2$ are always equal (not just collinear).



        On the other hand, that collinearity might not be much of a problem can be demonstrated by exhibiting some sample data with relatively little collinearity. We would want $x_2$ to be orthogonal to $x_2^2,$ of course, and then everything will be ok provided the interactions don't introduce collinearity. Unfortunately, $x_4$ and $x_4x_2^2$ are likely to be positively correlated. But by how much?



        Consider the data $x_2 = (-1,0,1,, -1,0,1,, -1,0,1)$ and $x_4 = (-1,sqrt3,-1,,0,0,0,,1,-sqrt3,1).$ The covariance matrix of the columns $(x_2, x_2^2, x_4x_2, x_4, x_4x_2^2)$ is



        $$pmatrix3&0&0&0&0 \ 0 & 1 & 0&0&0 \ 0&0&2&0&0 \ 0&0&0&5&2 \ 0&0&0&2&2/4.$$



        It is nearly orthogonal, with correlation only between the last two variables (as expected). (Notice that introducing $x_2^2$ has not changed anything, because this variable is orthogonal to all the others.) The ratio of the largest to the smallest eigenvalue (its condition number) is $6.$ This is not beautiful, but it's not bad, either. One could easily obtain reliable coefficient estimates with such explanatory variables.



        If you don't have the luxury of choosing the values of $x_2$ and $x_4$ to arrange such near-orthogonality, then you will simply have to proceed as anyone would always do in such cases: investigate the data you have and deal with any collinearity in the usual ways (which would include ignoring it; dropping variables based on scientific considerations; selecting some principal components; using a Lasso; and so on).







        share|cite|improve this answer












        share|cite|improve this answer



        share|cite|improve this answer










        answered 5 hours ago









        whuberwhuber

        206k33453822




        206k33453822





















            -1












            $begingroup$

            No, you don’t risk collinearity because $x_i$ are not linearly dependent in general, i.e. the below equation has just one solution holding for all possible $x_i$:
            $$a_1x_1+a_2x_2+a_3x_3+a_4x_4=0$$
            And that is $a_i=0$.



            In $x_5=x_1+x_2$ case, the following equation has non-zero solutions such that $a_1=a_2=-a_5$: $$a_1x_1+a_2x_2+a_5x_5=0$$






            share|cite|improve this answer











            $endgroup$








            • 1




              $begingroup$
              -1 Although usually $x_1, x_2, x_1x_2$ are not collinear, they can be extremely close to being so, which is the usual understanding of "collinear" in a regression setting. Moreover, there are plenty of special cases--which actually occur--where there are multiple solutions to your first equation. Consider binary variables $x_i$ coded as $pm 1,$ for instance.
              $endgroup$
              – whuber
              yesterday










            • $begingroup$
              @whuber, right, I tried to note down the situation as in general to actually refer to such cases (e.g. binary vars, domain limited variables etc.). But, I wouldn’t consider the situation of being extremely close as collinear, which is why I never commented on the situation. Because, the determinant of $X^TX$ is not $0$. Even if it is close to $0$. when we had infinite precision, we would again have a unique solution, although numerically we obtain very bad solutions. Anyway, I asked the OP to change the accepted answer. Thanks for your comment.
              $endgroup$
              – gunes
              yesterday
















            -1












            $begingroup$

            No, you don’t risk collinearity because $x_i$ are not linearly dependent in general, i.e. the below equation has just one solution holding for all possible $x_i$:
            $$a_1x_1+a_2x_2+a_3x_3+a_4x_4=0$$
            And that is $a_i=0$.



            In $x_5=x_1+x_2$ case, the following equation has non-zero solutions such that $a_1=a_2=-a_5$: $$a_1x_1+a_2x_2+a_5x_5=0$$






            share|cite|improve this answer











            $endgroup$








            • 1




              $begingroup$
              -1 Although usually $x_1, x_2, x_1x_2$ are not collinear, they can be extremely close to being so, which is the usual understanding of "collinear" in a regression setting. Moreover, there are plenty of special cases--which actually occur--where there are multiple solutions to your first equation. Consider binary variables $x_i$ coded as $pm 1,$ for instance.
              $endgroup$
              – whuber
              yesterday










            • $begingroup$
              @whuber, right, I tried to note down the situation as in general to actually refer to such cases (e.g. binary vars, domain limited variables etc.). But, I wouldn’t consider the situation of being extremely close as collinear, which is why I never commented on the situation. Because, the determinant of $X^TX$ is not $0$. Even if it is close to $0$. when we had infinite precision, we would again have a unique solution, although numerically we obtain very bad solutions. Anyway, I asked the OP to change the accepted answer. Thanks for your comment.
              $endgroup$
              – gunes
              yesterday














            -1












            -1








            -1





            $begingroup$

            No, you don’t risk collinearity because $x_i$ are not linearly dependent in general, i.e. the below equation has just one solution holding for all possible $x_i$:
            $$a_1x_1+a_2x_2+a_3x_3+a_4x_4=0$$
            And that is $a_i=0$.



            In $x_5=x_1+x_2$ case, the following equation has non-zero solutions such that $a_1=a_2=-a_5$: $$a_1x_1+a_2x_2+a_5x_5=0$$






            share|cite|improve this answer











            $endgroup$



            No, you don’t risk collinearity because $x_i$ are not linearly dependent in general, i.e. the below equation has just one solution holding for all possible $x_i$:
            $$a_1x_1+a_2x_2+a_3x_3+a_4x_4=0$$
            And that is $a_i=0$.



            In $x_5=x_1+x_2$ case, the following equation has non-zero solutions such that $a_1=a_2=-a_5$: $$a_1x_1+a_2x_2+a_5x_5=0$$







            share|cite|improve this answer














            share|cite|improve this answer



            share|cite|improve this answer








            edited yesterday

























            answered yesterday









            gunesgunes

            7,1461215




            7,1461215







            • 1




              $begingroup$
              -1 Although usually $x_1, x_2, x_1x_2$ are not collinear, they can be extremely close to being so, which is the usual understanding of "collinear" in a regression setting. Moreover, there are plenty of special cases--which actually occur--where there are multiple solutions to your first equation. Consider binary variables $x_i$ coded as $pm 1,$ for instance.
              $endgroup$
              – whuber
              yesterday










            • $begingroup$
              @whuber, right, I tried to note down the situation as in general to actually refer to such cases (e.g. binary vars, domain limited variables etc.). But, I wouldn’t consider the situation of being extremely close as collinear, which is why I never commented on the situation. Because, the determinant of $X^TX$ is not $0$. Even if it is close to $0$. when we had infinite precision, we would again have a unique solution, although numerically we obtain very bad solutions. Anyway, I asked the OP to change the accepted answer. Thanks for your comment.
              $endgroup$
              – gunes
              yesterday













            • 1




              $begingroup$
              -1 Although usually $x_1, x_2, x_1x_2$ are not collinear, they can be extremely close to being so, which is the usual understanding of "collinear" in a regression setting. Moreover, there are plenty of special cases--which actually occur--where there are multiple solutions to your first equation. Consider binary variables $x_i$ coded as $pm 1,$ for instance.
              $endgroup$
              – whuber
              yesterday










            • $begingroup$
              @whuber, right, I tried to note down the situation as in general to actually refer to such cases (e.g. binary vars, domain limited variables etc.). But, I wouldn’t consider the situation of being extremely close as collinear, which is why I never commented on the situation. Because, the determinant of $X^TX$ is not $0$. Even if it is close to $0$. when we had infinite precision, we would again have a unique solution, although numerically we obtain very bad solutions. Anyway, I asked the OP to change the accepted answer. Thanks for your comment.
              $endgroup$
              – gunes
              yesterday








            1




            1




            $begingroup$
            -1 Although usually $x_1, x_2, x_1x_2$ are not collinear, they can be extremely close to being so, which is the usual understanding of "collinear" in a regression setting. Moreover, there are plenty of special cases--which actually occur--where there are multiple solutions to your first equation. Consider binary variables $x_i$ coded as $pm 1,$ for instance.
            $endgroup$
            – whuber
            yesterday




            $begingroup$
            -1 Although usually $x_1, x_2, x_1x_2$ are not collinear, they can be extremely close to being so, which is the usual understanding of "collinear" in a regression setting. Moreover, there are plenty of special cases--which actually occur--where there are multiple solutions to your first equation. Consider binary variables $x_i$ coded as $pm 1,$ for instance.
            $endgroup$
            – whuber
            yesterday












            $begingroup$
            @whuber, right, I tried to note down the situation as in general to actually refer to such cases (e.g. binary vars, domain limited variables etc.). But, I wouldn’t consider the situation of being extremely close as collinear, which is why I never commented on the situation. Because, the determinant of $X^TX$ is not $0$. Even if it is close to $0$. when we had infinite precision, we would again have a unique solution, although numerically we obtain very bad solutions. Anyway, I asked the OP to change the accepted answer. Thanks for your comment.
            $endgroup$
            – gunes
            yesterday





            $begingroup$
            @whuber, right, I tried to note down the situation as in general to actually refer to such cases (e.g. binary vars, domain limited variables etc.). But, I wouldn’t consider the situation of being extremely close as collinear, which is why I never commented on the situation. Because, the determinant of $X^TX$ is not $0$. Even if it is close to $0$. when we had infinite precision, we would again have a unique solution, although numerically we obtain very bad solutions. Anyway, I asked the OP to change the accepted answer. Thanks for your comment.
            $endgroup$
            – gunes
            yesterday












            -1












            $begingroup$

            Don't do



            y = x1 + x2 + x3 + x3


            ...which is equivalent in your case to



             y = x1 + x2 + (x1 * x2) + (x1 / x2)


            Just create your model with this formula:



            y = x1 * x2. 


            The * sign indicates that you are also using the interaction effect, the equivalent of your x3. Alternate notation for this:



            y = x1 + x2 + x1:x2


            Your x3 is already taken into account when you include the interaction term, and your x4 will have huge multicolinearity problem with x1 and x2 so I would just ditch it.



            Here's an example. We create a model with two predictors and the interaction effect. Note that I could also use the formula Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length with the same result:



            > model <- lm(data = iris, Sepal.Length ~ Sepal.Width * Petal.Length)
            > summary(model)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width * Petal.Length, data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.99594 -0.21165 -0.01652 0.21244 0.77249

            Coefficients:
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.40438 0.53253 2.637 0.00926 **
            Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
            Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
            Sepal.Width:Petal.Length -0.07701 0.04305 -1.789 0.07571 .
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3308 on 146 degrees of freedom
            Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
            F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


            Then I add a variable for the product and include it in the model instead of the interaction:



            > 
            > iris$prod <- iris$Sepal.Width * iris$Petal.Length
            > model.prod <-
            + lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Length + prod)
            > summary(model.prod)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + prod,
            data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.99594 -0.21165 -0.01652 0.21244 0.77249

            Coefficients:
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.40438 0.53253 2.637 0.00926 **
            Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
            Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
            prod -0.07701 0.04305 -1.789 0.07571 .
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3308 on 146 degrees of freedom
            Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
            F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


            So, it's really the same thing. I can emphasize this by explicitely asking R to include both the product variable and the interaction term:



            > model.inter <-
            + lm(data = iris,
            + Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length + prod)
            > summary(model.inter)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length +
            prod, data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.99594 -0.21165 -0.01652 0.21244 0.77249

            Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.40438 0.53253 2.637 0.00926 **
            Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
            Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
            prod -0.07701 0.04305 -1.789 0.07571 .
            Sepal.Width:Petal.Length NA NA NA NA
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3308 on 146 degrees of freedom
            Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
            F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


            Using the dividend (x1 / x2) as well as the interaction actually diminished the adjusted R2 a little bit since it does not contribute much and I had to add another predictor. It did improve the residuals, but just one tiny bit.



            > model.divid <-
            + lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Length + divid + prod)
            > summary(model.divid)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + divid +
            prod, data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.98673 -0.21684 0.00684 0.21559 0.71138

            Coefficients:
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.58216 0.53349 2.966 0.00353 **
            Sepal.Width 0.55849 0.20998 2.660 0.00870 **
            Petal.Length 0.72299 0.13733 5.265 4.97e-07 ***
            divid 0.28102 0.13526 2.078 0.03951 *
            prod -0.04455 0.04535 -0.982 0.32750
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3271 on 145 degrees of freedom
            Multiple R-squared: 0.8481, Adjusted R-squared: 0.8439
            F-statistic: 202.4 on 4 and 145 DF, p-value: < 2.2e-16


            So why not use the dividend of x1 and x2? The problem is that you'd get multicolinearity between the dividend variable and x1 + x2. I won't include the code and output for brevity, but in my example, the adjusted R2-squared for divid ~ Sepal.Width + Petal.Length was 94%, with a p value < 2.2e-16.






            share|cite|improve this answer











            $endgroup$












            • $begingroup$
              Can I get some explanation for the downvote so that I can fix my answer? I gave a full explanation with demo.
              $endgroup$
              – GuillaumeL
              9 hours ago






            • 2




              $begingroup$
              There are several incorrect statements in this answer, such as "your x4 will have huge multicolinearity problem with x1 and x2 so I would just ditch it" and "The problem is that you'd get multicolinearity between the dividend variable and x1 + x2". Even if the assumption of the first statement were true in some specific case, the recommendation to drop x4 without further consideration is ill-advised. The initial discussion of how to fit this model in R is wrong: x1*x2 will not include the x1/x2 term.
              $endgroup$
              – whuber
              7 hours ago
















            -1












            $begingroup$

            Don't do



            y = x1 + x2 + x3 + x3


            ...which is equivalent in your case to



             y = x1 + x2 + (x1 * x2) + (x1 / x2)


            Just create your model with this formula:



            y = x1 * x2. 


            The * sign indicates that you are also using the interaction effect, the equivalent of your x3. Alternate notation for this:



            y = x1 + x2 + x1:x2


            Your x3 is already taken into account when you include the interaction term, and your x4 will have huge multicolinearity problem with x1 and x2 so I would just ditch it.



            Here's an example. We create a model with two predictors and the interaction effect. Note that I could also use the formula Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length with the same result:



            > model <- lm(data = iris, Sepal.Length ~ Sepal.Width * Petal.Length)
            > summary(model)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width * Petal.Length, data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.99594 -0.21165 -0.01652 0.21244 0.77249

            Coefficients:
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.40438 0.53253 2.637 0.00926 **
            Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
            Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
            Sepal.Width:Petal.Length -0.07701 0.04305 -1.789 0.07571 .
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3308 on 146 degrees of freedom
            Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
            F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


            Then I add a variable for the product and include it in the model instead of the interaction:



            > 
            > iris$prod <- iris$Sepal.Width * iris$Petal.Length
            > model.prod <-
            + lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Length + prod)
            > summary(model.prod)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + prod,
            data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.99594 -0.21165 -0.01652 0.21244 0.77249

            Coefficients:
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.40438 0.53253 2.637 0.00926 **
            Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
            Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
            prod -0.07701 0.04305 -1.789 0.07571 .
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3308 on 146 degrees of freedom
            Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
            F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


            So, it's really the same thing. I can emphasize this by explicitely asking R to include both the product variable and the interaction term:



            > model.inter <-
            + lm(data = iris,
            + Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length + prod)
            > summary(model.inter)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length +
            prod, data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.99594 -0.21165 -0.01652 0.21244 0.77249

            Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.40438 0.53253 2.637 0.00926 **
            Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
            Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
            prod -0.07701 0.04305 -1.789 0.07571 .
            Sepal.Width:Petal.Length NA NA NA NA
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3308 on 146 degrees of freedom
            Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
            F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


            Using the dividend (x1 / x2) as well as the interaction actually diminished the adjusted R2 a little bit since it does not contribute much and I had to add another predictor. It did improve the residuals, but just one tiny bit.



            > model.divid <-
            + lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Length + divid + prod)
            > summary(model.divid)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + divid +
            prod, data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.98673 -0.21684 0.00684 0.21559 0.71138

            Coefficients:
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.58216 0.53349 2.966 0.00353 **
            Sepal.Width 0.55849 0.20998 2.660 0.00870 **
            Petal.Length 0.72299 0.13733 5.265 4.97e-07 ***
            divid 0.28102 0.13526 2.078 0.03951 *
            prod -0.04455 0.04535 -0.982 0.32750
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3271 on 145 degrees of freedom
            Multiple R-squared: 0.8481, Adjusted R-squared: 0.8439
            F-statistic: 202.4 on 4 and 145 DF, p-value: < 2.2e-16


            So why not use the dividend of x1 and x2? The problem is that you'd get multicolinearity between the dividend variable and x1 + x2. I won't include the code and output for brevity, but in my example, the adjusted R2-squared for divid ~ Sepal.Width + Petal.Length was 94%, with a p value < 2.2e-16.






            share|cite|improve this answer











            $endgroup$












            • $begingroup$
              Can I get some explanation for the downvote so that I can fix my answer? I gave a full explanation with demo.
              $endgroup$
              – GuillaumeL
              9 hours ago






            • 2




              $begingroup$
              There are several incorrect statements in this answer, such as "your x4 will have huge multicolinearity problem with x1 and x2 so I would just ditch it" and "The problem is that you'd get multicolinearity between the dividend variable and x1 + x2". Even if the assumption of the first statement were true in some specific case, the recommendation to drop x4 without further consideration is ill-advised. The initial discussion of how to fit this model in R is wrong: x1*x2 will not include the x1/x2 term.
              $endgroup$
              – whuber
              7 hours ago














            -1












            -1








            -1





            $begingroup$

            Don't do



            y = x1 + x2 + x3 + x3


            ...which is equivalent in your case to



             y = x1 + x2 + (x1 * x2) + (x1 / x2)


            Just create your model with this formula:



            y = x1 * x2. 


            The * sign indicates that you are also using the interaction effect, the equivalent of your x3. Alternate notation for this:



            y = x1 + x2 + x1:x2


            Your x3 is already taken into account when you include the interaction term, and your x4 will have huge multicolinearity problem with x1 and x2 so I would just ditch it.



            Here's an example. We create a model with two predictors and the interaction effect. Note that I could also use the formula Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length with the same result:



            > model <- lm(data = iris, Sepal.Length ~ Sepal.Width * Petal.Length)
            > summary(model)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width * Petal.Length, data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.99594 -0.21165 -0.01652 0.21244 0.77249

            Coefficients:
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.40438 0.53253 2.637 0.00926 **
            Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
            Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
            Sepal.Width:Petal.Length -0.07701 0.04305 -1.789 0.07571 .
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3308 on 146 degrees of freedom
            Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
            F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


            Then I add a variable for the product and include it in the model instead of the interaction:



            > 
            > iris$prod <- iris$Sepal.Width * iris$Petal.Length
            > model.prod <-
            + lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Length + prod)
            > summary(model.prod)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + prod,
            data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.99594 -0.21165 -0.01652 0.21244 0.77249

            Coefficients:
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.40438 0.53253 2.637 0.00926 **
            Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
            Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
            prod -0.07701 0.04305 -1.789 0.07571 .
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3308 on 146 degrees of freedom
            Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
            F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


            So, it's really the same thing. I can emphasize this by explicitely asking R to include both the product variable and the interaction term:



            > model.inter <-
            + lm(data = iris,
            + Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length + prod)
            > summary(model.inter)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length +
            prod, data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.99594 -0.21165 -0.01652 0.21244 0.77249

            Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.40438 0.53253 2.637 0.00926 **
            Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
            Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
            prod -0.07701 0.04305 -1.789 0.07571 .
            Sepal.Width:Petal.Length NA NA NA NA
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3308 on 146 degrees of freedom
            Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
            F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


            Using the dividend (x1 / x2) as well as the interaction actually diminished the adjusted R2 a little bit since it does not contribute much and I had to add another predictor. It did improve the residuals, but just one tiny bit.



            > model.divid <-
            + lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Length + divid + prod)
            > summary(model.divid)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + divid +
            prod, data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.98673 -0.21684 0.00684 0.21559 0.71138

            Coefficients:
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.58216 0.53349 2.966 0.00353 **
            Sepal.Width 0.55849 0.20998 2.660 0.00870 **
            Petal.Length 0.72299 0.13733 5.265 4.97e-07 ***
            divid 0.28102 0.13526 2.078 0.03951 *
            prod -0.04455 0.04535 -0.982 0.32750
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3271 on 145 degrees of freedom
            Multiple R-squared: 0.8481, Adjusted R-squared: 0.8439
            F-statistic: 202.4 on 4 and 145 DF, p-value: < 2.2e-16


            So why not use the dividend of x1 and x2? The problem is that you'd get multicolinearity between the dividend variable and x1 + x2. I won't include the code and output for brevity, but in my example, the adjusted R2-squared for divid ~ Sepal.Width + Petal.Length was 94%, with a p value < 2.2e-16.






            share|cite|improve this answer











            $endgroup$



            Don't do



            y = x1 + x2 + x3 + x3


            ...which is equivalent in your case to



             y = x1 + x2 + (x1 * x2) + (x1 / x2)


            Just create your model with this formula:



            y = x1 * x2. 


            The * sign indicates that you are also using the interaction effect, the equivalent of your x3. Alternate notation for this:



            y = x1 + x2 + x1:x2


            Your x3 is already taken into account when you include the interaction term, and your x4 will have huge multicolinearity problem with x1 and x2 so I would just ditch it.



            Here's an example. We create a model with two predictors and the interaction effect. Note that I could also use the formula Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length with the same result:



            > model <- lm(data = iris, Sepal.Length ~ Sepal.Width * Petal.Length)
            > summary(model)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width * Petal.Length, data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.99594 -0.21165 -0.01652 0.21244 0.77249

            Coefficients:
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.40438 0.53253 2.637 0.00926 **
            Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
            Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
            Sepal.Width:Petal.Length -0.07701 0.04305 -1.789 0.07571 .
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3308 on 146 degrees of freedom
            Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
            F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


            Then I add a variable for the product and include it in the model instead of the interaction:



            > 
            > iris$prod <- iris$Sepal.Width * iris$Petal.Length
            > model.prod <-
            + lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Length + prod)
            > summary(model.prod)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + prod,
            data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.99594 -0.21165 -0.01652 0.21244 0.77249

            Coefficients:
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.40438 0.53253 2.637 0.00926 **
            Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
            Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
            prod -0.07701 0.04305 -1.789 0.07571 .
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3308 on 146 degrees of freedom
            Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
            F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


            So, it's really the same thing. I can emphasize this by explicitely asking R to include both the product variable and the interaction term:



            > model.inter <-
            + lm(data = iris,
            + Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length + prod)
            > summary(model.inter)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length +
            prod, data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.99594 -0.21165 -0.01652 0.21244 0.77249

            Coefficients: (1 not defined because of singularities)
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.40438 0.53253 2.637 0.00926 **
            Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
            Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
            prod -0.07701 0.04305 -1.789 0.07571 .
            Sepal.Width:Petal.Length NA NA NA NA
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3308 on 146 degrees of freedom
            Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404
            F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16


            Using the dividend (x1 / x2) as well as the interaction actually diminished the adjusted R2 a little bit since it does not contribute much and I had to add another predictor. It did improve the residuals, but just one tiny bit.



            > model.divid <-
            + lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Length + divid + prod)
            > summary(model.divid)

            Call:
            lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + divid +
            prod, data = iris)

            Residuals:
            Min 1Q Median 3Q Max
            -0.98673 -0.21684 0.00684 0.21559 0.71138

            Coefficients:
            Estimate Std. Error t value Pr(>|t|)
            (Intercept) 1.58216 0.53349 2.966 0.00353 **
            Sepal.Width 0.55849 0.20998 2.660 0.00870 **
            Petal.Length 0.72299 0.13733 5.265 4.97e-07 ***
            divid 0.28102 0.13526 2.078 0.03951 *
            prod -0.04455 0.04535 -0.982 0.32750
            ---
            Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            Residual standard error: 0.3271 on 145 degrees of freedom
            Multiple R-squared: 0.8481, Adjusted R-squared: 0.8439
            F-statistic: 202.4 on 4 and 145 DF, p-value: < 2.2e-16


            So why not use the dividend of x1 and x2? The problem is that you'd get multicolinearity between the dividend variable and x1 + x2. I won't include the code and output for brevity, but in my example, the adjusted R2-squared for divid ~ Sepal.Width + Petal.Length was 94%, with a p value < 2.2e-16.







            share|cite|improve this answer














            share|cite|improve this answer



            share|cite|improve this answer








            edited yesterday

























            answered yesterday









            GuillaumeLGuillaumeL

            29716




            29716











            • $begingroup$
              Can I get some explanation for the downvote so that I can fix my answer? I gave a full explanation with demo.
              $endgroup$
              – GuillaumeL
              9 hours ago






            • 2




              $begingroup$
              There are several incorrect statements in this answer, such as "your x4 will have huge multicolinearity problem with x1 and x2 so I would just ditch it" and "The problem is that you'd get multicolinearity between the dividend variable and x1 + x2". Even if the assumption of the first statement were true in some specific case, the recommendation to drop x4 without further consideration is ill-advised. The initial discussion of how to fit this model in R is wrong: x1*x2 will not include the x1/x2 term.
              $endgroup$
              – whuber
              7 hours ago

















            • $begingroup$
              Can I get some explanation for the downvote so that I can fix my answer? I gave a full explanation with demo.
              $endgroup$
              – GuillaumeL
              9 hours ago






            • 2




              $begingroup$
              There are several incorrect statements in this answer, such as "your x4 will have huge multicolinearity problem with x1 and x2 so I would just ditch it" and "The problem is that you'd get multicolinearity between the dividend variable and x1 + x2". Even if the assumption of the first statement were true in some specific case, the recommendation to drop x4 without further consideration is ill-advised. The initial discussion of how to fit this model in R is wrong: x1*x2 will not include the x1/x2 term.
              $endgroup$
              – whuber
              7 hours ago
















            $begingroup$
            Can I get some explanation for the downvote so that I can fix my answer? I gave a full explanation with demo.
            $endgroup$
            – GuillaumeL
            9 hours ago




            $begingroup$
            Can I get some explanation for the downvote so that I can fix my answer? I gave a full explanation with demo.
            $endgroup$
            – GuillaumeL
            9 hours ago




            2




            2




            $begingroup$
            There are several incorrect statements in this answer, such as "your x4 will have huge multicolinearity problem with x1 and x2 so I would just ditch it" and "The problem is that you'd get multicolinearity between the dividend variable and x1 + x2". Even if the assumption of the first statement were true in some specific case, the recommendation to drop x4 without further consideration is ill-advised. The initial discussion of how to fit this model in R is wrong: x1*x2 will not include the x1/x2 term.
            $endgroup$
            – whuber
            7 hours ago





            $begingroup$
            There are several incorrect statements in this answer, such as "your x4 will have huge multicolinearity problem with x1 and x2 so I would just ditch it" and "The problem is that you'd get multicolinearity between the dividend variable and x1 + x2". Even if the assumption of the first statement were true in some specific case, the recommendation to drop x4 without further consideration is ill-advised. The initial discussion of how to fit this model in R is wrong: x1*x2 will not include the x1/x2 term.
            $endgroup$
            – whuber
            7 hours ago


















            draft saved

            draft discarded
















































            Thanks for contributing an answer to Cross Validated!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f401487%2fcan-i-include-the-product-of-two-random-variables-or-do-i-risk-collinearity%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown