Why does more variables mean deeper trees in random forest?Random forest - binary classification vs. regression?Random forest vs regressionRandom Forest: more trees gave worse resultsWeighting more recent data in Random Forest modelDecreased Performance of Random Forest Classifier When Using More FeaturesAll the trees from randomForest-model are use all the attributes of object, but not only mtry?Random forest variable importance: Mean minimal depth and number of nodes disagreeDifferences in calibration plots for machine learning modelswhy does random forest trees need to be deeper than gradient boosting trees
How do the surviving Asgardians get to Earth?
Write The Shortest Program To Check If A Binary Tree Is Balanced
Generate random number in Unity without class ambiguity
Pronouns when writing from the point of view of a robot
The Game of the Century - why didn't Byrne take the rook after he forked Fischer?
Is there a way to improve my grade after graduation?
Four-velocity of radially infalling gas in Schwarzschild metric
What is the reason behind water not falling from a bucket at the top of loop?
Drawing arrowtips at the end of each segment in a polygonal path
How to check a file was encrypted (really & correctly)
Plotting Autoregressive Functions / Linear Difference Equations
How do people drown while wearing a life jacket?
What's "halachic" about "Esav hates Ya'akov"?
Can the Cauchy product of divergent series with itself be convergent?
Is the first page of a novel really that important?
Broken bottom bracket?
Why is it to say 'paucis post diebus'?
What license to choose for my PhD thesis?
What percentage of campground outlets are GFCI or RCD protected?
How do I handle a DM that plays favorites with certain players?
What do "unsettled funds" mean?
ZFS on Linux: Which mountpoint option when mounting manually per script?
What does C++ language definition say about the extent of the static keyword?
How to call made-up data?
Why does more variables mean deeper trees in random forest?
Random forest - binary classification vs. regression?Random forest vs regressionRandom Forest: more trees gave worse resultsWeighting more recent data in Random Forest modelDecreased Performance of Random Forest Classifier When Using More FeaturesAll the trees from randomForest-model are use all the attributes of object, but not only mtry?Random forest variable importance: Mean minimal depth and number of nodes disagreeDifferences in calibration plots for machine learning modelswhy does random forest trees need to be deeper than gradient boosting trees
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
I'm looking at the depth of trees in a random forest model, using the randomForest
and randomnForestExplainer
package in R.
The model I'm using is a basic linear regression model where there are 3 important predictor variables (p) and the rest are noise (q).
The first test I ran I set p = 3 and q = 10 and found that the mean minimal depth of the variables was never over 7 trees.
However, for the second test I set p = 3 and q = 100 and found that the mean minimal depth of the variables was 17 trees.
this can be seen in the plots below for both tests, where the colour-coded bar on the right displays the minimal depth of each variable.
So, my question is: why does adding more noise variables to my model mean deeper trees?
r random-forest
$endgroup$
add a comment |
$begingroup$
I'm looking at the depth of trees in a random forest model, using the randomForest
and randomnForestExplainer
package in R.
The model I'm using is a basic linear regression model where there are 3 important predictor variables (p) and the rest are noise (q).
The first test I ran I set p = 3 and q = 10 and found that the mean minimal depth of the variables was never over 7 trees.
However, for the second test I set p = 3 and q = 100 and found that the mean minimal depth of the variables was 17 trees.
this can be seen in the plots below for both tests, where the colour-coded bar on the right displays the minimal depth of each variable.
So, my question is: why does adding more noise variables to my model mean deeper trees?
r random-forest
$endgroup$
add a comment |
$begingroup$
I'm looking at the depth of trees in a random forest model, using the randomForest
and randomnForestExplainer
package in R.
The model I'm using is a basic linear regression model where there are 3 important predictor variables (p) and the rest are noise (q).
The first test I ran I set p = 3 and q = 10 and found that the mean minimal depth of the variables was never over 7 trees.
However, for the second test I set p = 3 and q = 100 and found that the mean minimal depth of the variables was 17 trees.
this can be seen in the plots below for both tests, where the colour-coded bar on the right displays the minimal depth of each variable.
So, my question is: why does adding more noise variables to my model mean deeper trees?
r random-forest
$endgroup$
I'm looking at the depth of trees in a random forest model, using the randomForest
and randomnForestExplainer
package in R.
The model I'm using is a basic linear regression model where there are 3 important predictor variables (p) and the rest are noise (q).
The first test I ran I set p = 3 and q = 10 and found that the mean minimal depth of the variables was never over 7 trees.
However, for the second test I set p = 3 and q = 100 and found that the mean minimal depth of the variables was 17 trees.
this can be seen in the plots below for both tests, where the colour-coded bar on the right displays the minimal depth of each variable.
So, my question is: why does adding more noise variables to my model mean deeper trees?
r random-forest
r random-forest
asked Jul 25 at 15:37
ElectrinoElectrino
212 bronze badges
212 bronze badges
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
Splitting on noise features sometimes yields information gain (reduction in entropy, reduction in gini) purely by chance. When more noise features are included, it's more likely that the subsample of features selected at each split will be entirely composed of noise features. When this happens, no signal features are available. By the same token, the split quality is probably lower than it would be had a signal feature been used, so more splits (i.e. a deeper tree) will be required to attain the termination criteria (usually leaf purity).
Typically, random forest is set up to split until leaf purity. This can be a source of overfitting; indeed, the fact that your random forest is pulling in noise features for splits indicates that some amount of overfitting is taking place. To regularize the model, you can impose limitations on tree depth, the minimum information gain required to split, the number of samples to split, or leaf size to attempt to forestall spurious splits. The improvement in model quality from doing so is usually modest. These claims draw from the discussion in Hastie et al, Elements of Statistical Learning, p. 596
More information about how spurious features behave in random forest can be found in https://explained.ai/rf-importance/
$endgroup$
$begingroup$
This makes sense... Thanks for your answer
$endgroup$
– Electrino
Jul 25 at 16:36
add a comment |
$begingroup$
Random forests sample variables at each split.
The default is to sample $sqrtp$ variables each time.
If you add more noise variables, the chance of the good variables being in the sample decreases. Hence they tend to appear first, on average, at a deeper level than before.
With 4+10 variables, there is about a 30% chance of each good variable being picked. With 4+100, the chance is only 10.6%. the chance of at least one good variable being available at height 1 are 1-(1-p)⁴, so 83% resp. 36% that one of the good variables is available and otherwise the first split has to use a noise variable etc.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "65"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f419195%2fwhy-does-more-variables-mean-deeper-trees-in-random-forest%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Splitting on noise features sometimes yields information gain (reduction in entropy, reduction in gini) purely by chance. When more noise features are included, it's more likely that the subsample of features selected at each split will be entirely composed of noise features. When this happens, no signal features are available. By the same token, the split quality is probably lower than it would be had a signal feature been used, so more splits (i.e. a deeper tree) will be required to attain the termination criteria (usually leaf purity).
Typically, random forest is set up to split until leaf purity. This can be a source of overfitting; indeed, the fact that your random forest is pulling in noise features for splits indicates that some amount of overfitting is taking place. To regularize the model, you can impose limitations on tree depth, the minimum information gain required to split, the number of samples to split, or leaf size to attempt to forestall spurious splits. The improvement in model quality from doing so is usually modest. These claims draw from the discussion in Hastie et al, Elements of Statistical Learning, p. 596
More information about how spurious features behave in random forest can be found in https://explained.ai/rf-importance/
$endgroup$
$begingroup$
This makes sense... Thanks for your answer
$endgroup$
– Electrino
Jul 25 at 16:36
add a comment |
$begingroup$
Splitting on noise features sometimes yields information gain (reduction in entropy, reduction in gini) purely by chance. When more noise features are included, it's more likely that the subsample of features selected at each split will be entirely composed of noise features. When this happens, no signal features are available. By the same token, the split quality is probably lower than it would be had a signal feature been used, so more splits (i.e. a deeper tree) will be required to attain the termination criteria (usually leaf purity).
Typically, random forest is set up to split until leaf purity. This can be a source of overfitting; indeed, the fact that your random forest is pulling in noise features for splits indicates that some amount of overfitting is taking place. To regularize the model, you can impose limitations on tree depth, the minimum information gain required to split, the number of samples to split, or leaf size to attempt to forestall spurious splits. The improvement in model quality from doing so is usually modest. These claims draw from the discussion in Hastie et al, Elements of Statistical Learning, p. 596
More information about how spurious features behave in random forest can be found in https://explained.ai/rf-importance/
$endgroup$
$begingroup$
This makes sense... Thanks for your answer
$endgroup$
– Electrino
Jul 25 at 16:36
add a comment |
$begingroup$
Splitting on noise features sometimes yields information gain (reduction in entropy, reduction in gini) purely by chance. When more noise features are included, it's more likely that the subsample of features selected at each split will be entirely composed of noise features. When this happens, no signal features are available. By the same token, the split quality is probably lower than it would be had a signal feature been used, so more splits (i.e. a deeper tree) will be required to attain the termination criteria (usually leaf purity).
Typically, random forest is set up to split until leaf purity. This can be a source of overfitting; indeed, the fact that your random forest is pulling in noise features for splits indicates that some amount of overfitting is taking place. To regularize the model, you can impose limitations on tree depth, the minimum information gain required to split, the number of samples to split, or leaf size to attempt to forestall spurious splits. The improvement in model quality from doing so is usually modest. These claims draw from the discussion in Hastie et al, Elements of Statistical Learning, p. 596
More information about how spurious features behave in random forest can be found in https://explained.ai/rf-importance/
$endgroup$
Splitting on noise features sometimes yields information gain (reduction in entropy, reduction in gini) purely by chance. When more noise features are included, it's more likely that the subsample of features selected at each split will be entirely composed of noise features. When this happens, no signal features are available. By the same token, the split quality is probably lower than it would be had a signal feature been used, so more splits (i.e. a deeper tree) will be required to attain the termination criteria (usually leaf purity).
Typically, random forest is set up to split until leaf purity. This can be a source of overfitting; indeed, the fact that your random forest is pulling in noise features for splits indicates that some amount of overfitting is taking place. To regularize the model, you can impose limitations on tree depth, the minimum information gain required to split, the number of samples to split, or leaf size to attempt to forestall spurious splits. The improvement in model quality from doing so is usually modest. These claims draw from the discussion in Hastie et al, Elements of Statistical Learning, p. 596
More information about how spurious features behave in random forest can be found in https://explained.ai/rf-importance/
edited Jul 25 at 17:34
answered Jul 25 at 15:42
SycoraxSycorax
46.2k15 gold badges121 silver badges219 bronze badges
46.2k15 gold badges121 silver badges219 bronze badges
$begingroup$
This makes sense... Thanks for your answer
$endgroup$
– Electrino
Jul 25 at 16:36
add a comment |
$begingroup$
This makes sense... Thanks for your answer
$endgroup$
– Electrino
Jul 25 at 16:36
$begingroup$
This makes sense... Thanks for your answer
$endgroup$
– Electrino
Jul 25 at 16:36
$begingroup$
This makes sense... Thanks for your answer
$endgroup$
– Electrino
Jul 25 at 16:36
add a comment |
$begingroup$
Random forests sample variables at each split.
The default is to sample $sqrtp$ variables each time.
If you add more noise variables, the chance of the good variables being in the sample decreases. Hence they tend to appear first, on average, at a deeper level than before.
With 4+10 variables, there is about a 30% chance of each good variable being picked. With 4+100, the chance is only 10.6%. the chance of at least one good variable being available at height 1 are 1-(1-p)⁴, so 83% resp. 36% that one of the good variables is available and otherwise the first split has to use a noise variable etc.
$endgroup$
add a comment |
$begingroup$
Random forests sample variables at each split.
The default is to sample $sqrtp$ variables each time.
If you add more noise variables, the chance of the good variables being in the sample decreases. Hence they tend to appear first, on average, at a deeper level than before.
With 4+10 variables, there is about a 30% chance of each good variable being picked. With 4+100, the chance is only 10.6%. the chance of at least one good variable being available at height 1 are 1-(1-p)⁴, so 83% resp. 36% that one of the good variables is available and otherwise the first split has to use a noise variable etc.
$endgroup$
add a comment |
$begingroup$
Random forests sample variables at each split.
The default is to sample $sqrtp$ variables each time.
If you add more noise variables, the chance of the good variables being in the sample decreases. Hence they tend to appear first, on average, at a deeper level than before.
With 4+10 variables, there is about a 30% chance of each good variable being picked. With 4+100, the chance is only 10.6%. the chance of at least one good variable being available at height 1 are 1-(1-p)⁴, so 83% resp. 36% that one of the good variables is available and otherwise the first split has to use a noise variable etc.
$endgroup$
Random forests sample variables at each split.
The default is to sample $sqrtp$ variables each time.
If you add more noise variables, the chance of the good variables being in the sample decreases. Hence they tend to appear first, on average, at a deeper level than before.
With 4+10 variables, there is about a 30% chance of each good variable being picked. With 4+100, the chance is only 10.6%. the chance of at least one good variable being available at height 1 are 1-(1-p)⁴, so 83% resp. 36% that one of the good variables is available and otherwise the first split has to use a noise variable etc.
edited Jul 25 at 17:21
answered Jul 25 at 17:14
Anony-MousseAnony-Mousse
32.4k6 gold badges44 silver badges85 bronze badges
32.4k6 gold badges44 silver badges85 bronze badges
add a comment |
add a comment |
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f419195%2fwhy-does-more-variables-mean-deeper-trees-in-random-forest%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown