Ttdfjt

Question

I have a model in which I want to predict Y.

My regressors X, are x1 and x2.

For some reason I believe that it would also be useful to include into the model:

x3 = x1 * x2

x4 = x1 / x2

Can I use a regressors x1, x2, x3 and x4 altogether or do I risk perfect collinearity problem.
I know for instance that using x5 = x1 + x2 would yield perfect collinearity and hence a completely useless regressor.

To would-be respondents: If you believe the OP's model is always problematic, you are asserting that all quartic polynomial regressions are problematic. To see why that is, suppose that in reality the response $Y$ has a quartic polynomial relationship with a single variable $x:$ $$E[Y]=beta_0+beta_1x+beta_2x^2+beta_3x^3+beta_4x^4.$$ Suppose one has measured not only $x_2=x$ but also--unwittingly--$x_1=x^3.$ In terms of these variables, $x^2=x_1/x_2$ and $x^4=x_1x_2.$ Thus, this model is identical to $$E[Y]=beta_0+beta_1x_2+beta_2x_1/x_2+beta_3x_1+beta_4x_1x_2.$$ — yesterday
OP has not specified the type of regression, nor that x1 = x2. — 9 hours ago

score 5 · Accepted Answer · 2019-04-07 01:33:57Z

5

You won't have perfect collinearity (as per your question), but you do risk multicollinearity issues with your two additional regressors.

While they're not algebraically linear combinations of the two predictors, it can be the case that these variables (x1-x4) in a particular sample might lay close to to a linear subspace - with the typical consequences of near-multicollinearity.

For example, if the two original variates both have very small coefficients of variation then their product can be quite closely related to their sum (or some other linear combination if they're dissimilar in size). This can happen even if the original variables are not highly correlated.

Similarly, high correlation can happen with the ratio and the difference.

edited 20 hours ago

answered yesterday

Glen_b♦

215k23418771

1

$begingroup$
+1. As I understand things, you want to center your variables (subtract the mean), which will help to reduce correlation between x1, x2, and x1 * x2.
$endgroup$
– Wayne
yesterday

1

$begingroup$
It would certainly help; but then we can come back to other examples; in the case of certain kinds of relationships between x1 and x2, the collection of x1,x2,x1*x2 and x1/x2 can still result in near-multicollinearity
$endgroup$
– Glen_b♦
yesterday

1

$begingroup$
+1 (So far,) this is the only answer with a correct analysis and good advice. @Wayne Centering the variables, however, changes the very meanings of x1*x2 and x1/x2. Moreover, the problem can get worse with centered variables: consider the (very common) case where both x1 and x2 are binary: after centering them, you will obtain exact collinearity among x1*x2 and x1/x2.
$endgroup$
– whuber♦
yesterday

1

$begingroup$
@Wayne This works only because the constant is implicitly included, whence a model in terms of $1,x_1,x_2,$ and $x_1x_2$ is equivalent to a model in terms of $1,x_1-barx_1,x_2-barx_2,$ and $(x_1-barx_1)(x_2-barx_2).$ This analysis doesn't work when applied to ratios like $x_1/x_2.$ An analysis that does work is to construct an orthogonal basis for the data columns $1,x_1,x_2,x_1x_2,$ and $x_1/x_2.$ This generalizes the idea, if not the algebra, and eliminates all collinearity.
$endgroup$
– whuber♦
yesterday

1

$begingroup$
@Wayne Right: that would be one way to orthogonalize them. It's a generalization of orthogonal polynomials. Nothing's free, though: if there is near-multicollinearity of the original variables, it will show up in the form of one or more principal components that are essentially "noise" and likely have estimated regression coefficients with huge standard errors. But getting even this far can be informative.
$endgroup$
– whuber♦
yesterday

|
show 2 more comments

whuber♦whuber 206k33453822 · Accepted Answer · 2019-04-07 16:49:52Z

Operating under your stated assumption that $x_3=x_1x_2$ and $x_4=x_1/x_2$ need to be entertained as possible explanatory variables in a model of a response $Y$ (and therefore not summarily dropped because they might be a little inconvenient), it can be helpful to consider alternative ways of expressing this model.

As stated, the model is of the form

$$Y sim F(x_1, x_2, x_3, x_4; theta) = F(x_1, x_2, x_1x_2, x_1/x_2; theta)$$

for a given distribution family $F$ involving unknown parameters $theta$ to be determined. For instance, a linear regression model would involve a five-dimensional parameter $theta = (beta_0, beta_1, beta_2, beta_3, beta_4)$ in the form

$$E[Y] = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_3 + beta_4 x_4 = beta_0 + beta_1 x_1 + beta_2 x_2 + beta_3 x_1x_2 + beta_4 x_1/x_2.$$

For simplicity of exposition, let's analyse the linear regression model: it will be clear how the analysis extends to other models.

One way is to restate the model in terms of $x_4$ and $x_2,$ which algebraically imply $x_1=x_2x_4$ and $x_3=x_2^2x_4:$

$$E[Y] = beta_0 + beta_1 x_4x_2 + beta_2 x_2 + beta_3 x_4x_2^2 + beta_4 x_4 = beta_0 + beta_2 x_2 + beta_4 x_4 + x_4left(beta_1 x_2 + beta_3 x_2^2right).$$

The last term would ordinarily be characterized as an interaction between $x_4$ and a quadratic function of $x_2.$ Since, except in very special circumstances, interactions should be included only when their component terms are included, this suggests you ought to extend to model to include an $x_2^2$ term. It would have the form

$$E[Y] = beta_0 + left(beta_2 x_2 + beta_5 x_2^2right) + beta_4 x_4 + x_4left(beta_1 x_2 + beta_3 x_2^2right).$$

That is a model involving (a) $x_4$ and (b) the simplest possible quadratic spline of $x_2.$ Such models are common: the quadratic terms allow for some amount of nonlinear response in $x_2$ and the interaction allows for the response to change with different values of $x_4$ in a controlled way.

These simple algebraic manipulations demonstrate that the proposed model is not at all unusual. They reframe it in terms of standard, well-understood concepts.

There remains the question of collinearity. That collinearity could be a problem is demonstrated by the case where both $x_1$ and $x_2$ are binary variables coded as $pm 1.$ In this case, $x_1/x_2$ and $x_1x_2$ are always equal (not just collinear).

On the other hand, that collinearity might not be much of a problem can be demonstrated by exhibiting some sample data with relatively little collinearity. We would want $x_2$ to be orthogonal to $x_2^2,$ of course, and then everything will be ok provided the interactions don't introduce collinearity. Unfortunately, $x_4$ and $x_4x_2^2$ are likely to be positively correlated. But by how much?

Consider the data $x_2 = (-1,0,1,, -1,0,1,, -1,0,1)$ and $x_4 = (-1,sqrt3,-1,,0,0,0,,1,-sqrt3,1).$ The covariance matrix of the columns $(x_2, x_2^2, x_4x_2, x_4, x_4x_2^2)$ is

$$pmatrix3&0&0&0&0 \ 0 & 1 & 0&0&0 \ 0&0&2&0&0 \ 0&0&0&5&2 \ 0&0&0&2&2/4.$$

It is nearly orthogonal, with correlation only between the last two variables (as expected). (Notice that introducing $x_2^2$ has not changed anything, because this variable is orthogonal to all the others.) The ratio of the largest to the smallest eigenvalue (its condition number) is $6.$ This is not beautiful, but it's not bad, either. One could easily obtain reliable coefficient estimates with such explanatory variables.

If you don't have the luxury of choosing the values of $x_2$ and $x_4$ to arrange such near-orthogonality, then you will simply have to proceed as anyone would always do in such cases: investigate the data you have and deal with any collinearity in the usual ways (which would include ignoring it; dropping variables based on scientific considerations; selecting some principal components; using a Lasso; and so on).

score -1 · Accepted Answer · 2019-04-06 10:02:44Z

-1

No, you don’t risk collinearity because $x_i$ are not linearly dependent in general, i.e. the below equation has just one solution holding for all possible $x_i$:
$$a_1x_1+a_2x_2+a_3x_3+a_4x_4=0$$
And that is $a_i=0$.

In $x_5=x_1+x_2$ case, the following equation has non-zero solutions such that $a_1=a_2=-a_5$: $$a_1x_1+a_2x_2+a_5x_5=0$$

edited yesterday

answered yesterday

gunes

7,1461215

1

$begingroup$
-1 Although usually $x_1, x_2, x_1x_2$ are not collinear, they can be extremely close to being so, which is the usual understanding of "collinear" in a regression setting. Moreover, there are plenty of special cases--which actually occur--where there are multiple solutions to your first equation. Consider binary variables $x_i$ coded as $pm 1,$ for instance.
$endgroup$
– whuber♦
yesterday

$begingroup$
@whuber, right, I tried to note down the situation as in general to actually refer to such cases (e.g. binary vars, domain limited variables etc.). But, I wouldn’t consider the situation of being extremely close as collinear, which is why I never commented on the situation. Because, the determinant of $X^TX$ is not $0$. Even if it is close to $0$. when we had infinite precision, we would again have a unique solution, although numerically we obtain very bad solutions. Anyway, I asked the OP to change the accepted answer. Thanks for your comment.
$endgroup$
– gunes
yesterday

add a comment |

score -1 · Accepted Answer · 2019-04-06 16:53:28Z

Don't do

y = x1 + x2 + x3 + x3

...which is equivalent in your case to

 y = x1 + x2 + (x1 * x2) + (x1 / x2)

Just create your model with this formula:

y = x1 * x2.

The * sign indicates that you are also using the interaction effect, the equivalent of your x3. Alternate notation for this:

y = x1 + x2 + x1:x2

Your x3 is already taken into account when you include the interaction term, and your x4 will have huge multicolinearity problem with x1 and x2 so I would just ditch it.

Here's an example. We create a model with two predictors and the interaction effect. Note that I could also use the formula Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length with the same result:

> model <- lm(data = iris, Sepal.Length ~ Sepal.Width * Petal.Length)
> summary(model)

Call:
lm(formula = Sepal.Length ~ Sepal.Width * Petal.Length, data = iris)

Residuals:
 Min 1Q Median 3Q Max 
-0.99594 -0.21165 -0.01652 0.21244 0.77249 

Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) 1.40438 0.53253 2.637 0.00926 ** 
Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
Sepal.Width:Petal.Length -0.07701 0.04305 -1.789 0.07571 . 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3308 on 146 degrees of freedom
Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404 
F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16

Then I add a variable for the product and include it in the model instead of the interaction:

> 
> iris$prod <- iris$Sepal.Width * iris$Petal.Length
> model.prod <-
+ lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Length + prod)
> summary(model.prod)

Call:
lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + prod, 
 data = iris)

Residuals:
 Min 1Q Median 3Q Max 
-0.99594 -0.21165 -0.01652 0.21244 0.77249 

Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) 1.40438 0.53253 2.637 0.00926 ** 
Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
prod -0.07701 0.04305 -1.789 0.07571 . 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3308 on 146 degrees of freedom
Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404 
F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16

So, it's really the same thing. I can emphasize this by explicitely asking R to include both the product variable and the interaction term:

> model.inter <-
+ lm(data = iris,
+ Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length + prod)
> summary(model.inter)

Call:
lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Sepal.Width:Petal.Length + 
 prod, data = iris)

Residuals:
 Min 1Q Median 3Q Max 
-0.99594 -0.21165 -0.01652 0.21244 0.77249 

Coefficients: (1 not defined because of singularities)
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) 1.40438 0.53253 2.637 0.00926 ** 
Sepal.Width 0.84996 0.15800 5.379 2.91e-07 ***
Petal.Length 0.71846 0.13886 5.174 7.45e-07 ***
prod -0.07701 0.04305 -1.789 0.07571 . 
Sepal.Width:Petal.Length NA NA NA NA 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3308 on 146 degrees of freedom
Multiple R-squared: 0.8436, Adjusted R-squared: 0.8404 
F-statistic: 262.5 on 3 and 146 DF, p-value: < 2.2e-16

Using the dividend (x1 / x2) as well as the interaction actually diminished the adjusted R2 a little bit since it does not contribute much and I had to add another predictor. It did improve the residuals, but just one tiny bit.

> model.divid <-
+ lm(data = iris, Sepal.Length ~ Sepal.Width + Petal.Length + divid + prod)
> summary(model.divid)

Call:
lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + divid + 
 prod, data = iris)

Residuals:
 Min 1Q Median 3Q Max 
-0.98673 -0.21684 0.00684 0.21559 0.71138 

Coefficients:
 Estimate Std. Error t value Pr(>|t|) 
(Intercept) 1.58216 0.53349 2.966 0.00353 ** 
Sepal.Width 0.55849 0.20998 2.660 0.00870 ** 
Petal.Length 0.72299 0.13733 5.265 4.97e-07 ***
divid 0.28102 0.13526 2.078 0.03951 * 
prod -0.04455 0.04535 -0.982 0.32750 
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3271 on 145 degrees of freedom
Multiple R-squared: 0.8481, Adjusted R-squared: 0.8439 
F-statistic: 202.4 on 4 and 145 DF, p-value: < 2.2e-16

So why not use the dividend of x1 and x2? The problem is that you'd get multicolinearity between the dividend variable and x1 + x2. I won't include the code and output for brevity, but in my example, the adjusted R2-squared for divid ~ Sepal.Width + Petal.Length was 94%, with a p value < 2.2e-16.

Can I get some explanation for the downvote so that I can fix my answer? I gave a full explanation with demo. — 9 hours ago
There are several incorrect statements in this answer, such as "your x4 will have huge multicolinearity problem with x1 and x2 so I would just ditch it" and "The problem is that you'd get multicolinearity between the dividend variable and x1 + x2". Even if the assumption of the first statement were true in some specific case, the recommendation to drop x4 without further consideration is ill-advised. The initial discussion of how to fit this model in R is wrong: x1*x2 will not include the x1/x2 term. — 7 hours ago

score 5 · Accepted Answer · 2019-04-07 01:33:57Z

5

You won't have perfect collinearity (as per your question), but you do risk multicollinearity issues with your two additional regressors.

While they're not algebraically linear combinations of the two predictors, it can be the case that these variables (x1-x4) in a particular sample might lay close to to a linear subspace - with the typical consequences of near-multicollinearity.

For example, if the two original variates both have very small coefficients of variation then their product can be quite closely related to their sum (or some other linear combination if they're dissimilar in size). This can happen even if the original variables are not highly correlated.

Similarly, high correlation can happen with the ratio and the difference.

edited 20 hours ago

answered yesterday

Glen_b♦

215k23418771

1

$begingroup$
+1. As I understand things, you want to center your variables (subtract the mean), which will help to reduce correlation between x1, x2, and x1 * x2.
$endgroup$
– Wayne
yesterday

1

$begingroup$
It would certainly help; but then we can come back to other examples; in the case of certain kinds of relationships between x1 and x2, the collection of x1,x2,x1*x2 and x1/x2 can still result in near-multicollinearity
$endgroup$
– Glen_b♦
yesterday

1

$begingroup$
+1 (So far,) this is the only answer with a correct analysis and good advice. @Wayne Centering the variables, however, changes the very meanings of x1*x2 and x1/x2. Moreover, the problem can get worse with centered variables: consider the (very common) case where both x1 and x2 are binary: after centering them, you will obtain exact collinearity among x1*x2 and x1/x2.
$endgroup$
– whuber♦
yesterday

1

$begingroup$
@Wayne This works only because the constant is implicitly included, whence a model in terms of $1,x_1,x_2,$ and $x_1x_2$ is equivalent to a model in terms of $1,x_1-barx_1,x_2-barx_2,$ and $(x_1-barx_1)(x_2-barx_2).$ This analysis doesn't work when applied to ratios like $x_1/x_2.$ An analysis that does work is to construct an orthogonal basis for the data columns $1,x_1,x_2,x_1x_2,$ and $x_1/x_2.$ This generalizes the idea, if not the algebra, and eliminates all collinearity.
$endgroup$
– whuber♦
yesterday

1

$begingroup$
@Wayne Right: that would be one way to orthogonalize them. It's a generalization of orthogonal polynomials. Nothing's free, though: if there is near-multicollinearity of the original variables, it will show up in the form of one or more principal components that are essentially "noise" and likely have estimated regression coefficients with huge standard errors. But getting even this far can be informative.
$endgroup$
– whuber♦
yesterday

|
show 2 more comments

whuber♦whuber 206k33453822 · Accepted Answer · 2019-04-07 16:49:52Z