Why do RNNs usually have fewer hidden layers than CNNs?Convnet training error does not decreaseWhy are RNN/LSTM preferred in time series analysis and not other NN?Is there an intuitive explanation why some neural networks have more than one fully connected layers?Understand the shape of this Convolutional Neural NetworkHow does weight sharing stabilize the learning in RNNs and CNNs?RNN model with 3 hidden layersThe mix of leaky Relu at the first layers of CNN along with conventional Relu for object detectionUsing deep learning to classify similar imagesUsing t-SNE to track progress of a word vector embedding model. Pitfalls?Why do recurrent layers work better than simple feed-forward networks?
How to avoid offending original culture when making conculture inspired from original
What kind of chart is this?
Is the infant mortality rate among African-American babies in Youngstown, Ohio greater than that of babies in Iran?
On George Box, Galit Shmueli and the scientific method?
Digital signature that is only verifiable by one specific person
How to address players struggling with simple controls?
100-doors puzzle
Redirecting output only on a successful command call
Have Steve Rogers (Captain America) and a young Erik Lehnsherr (Magneto) interacted during WWII?
Manager wants to hire me; HR does not. How to proceed?
What is this plant I saw for sale at a Romanian farmer's market?
How can this shape perfectly cover a cube?
Interview was just a one hour panel. Got an offer the next day; do I accept or is this a red flag?
Using roof rails to set up hammock
What is the context for Napoleon's quote "[the Austrians] did not know the value of five minutes"?
Why do you need to heat the pan before heating the olive oil?
Can a non-invertible function be inverted by returning a set of all possible solutions?
How to make all magic-casting innate, but still rare?
1960s sci-fi anthology with a Viking fighting a U.S. army MP on the cover
What does a/.b[c][[1]] mean?
How can Caller ID be faked?
New Site Design!
Catching a robber on one line
Lead the way to this Literary Knight to its final “DESTINATION”
Why do RNNs usually have fewer hidden layers than CNNs?
Convnet training error does not decreaseWhy are RNN/LSTM preferred in time series analysis and not other NN?Is there an intuitive explanation why some neural networks have more than one fully connected layers?Understand the shape of this Convolutional Neural NetworkHow does weight sharing stabilize the learning in RNNs and CNNs?RNN model with 3 hidden layersThe mix of leaky Relu at the first layers of CNN along with conventional Relu for object detectionUsing deep learning to classify similar imagesUsing t-SNE to track progress of a word vector embedding model. Pitfalls?Why do recurrent layers work better than simple feed-forward networks?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
CNNs can have hundreds of hidden layers and since they are often used with image data, having many layers captures more complexity.
However, as far as I have seen, RNNs usually have few layers e.g. 2-4. For example, for electrocardiogram (ECG) classification, I've seen papers use LSTMs with 4 layers and CNNs with 10-15 layers with similar results.
Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?
deep-learning cnn lstm rnn feature-extraction
New contributor
$endgroup$
add a comment |
$begingroup$
CNNs can have hundreds of hidden layers and since they are often used with image data, having many layers captures more complexity.
However, as far as I have seen, RNNs usually have few layers e.g. 2-4. For example, for electrocardiogram (ECG) classification, I've seen papers use LSTMs with 4 layers and CNNs with 10-15 layers with similar results.
Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?
deep-learning cnn lstm rnn feature-extraction
New contributor
$endgroup$
add a comment |
$begingroup$
CNNs can have hundreds of hidden layers and since they are often used with image data, having many layers captures more complexity.
However, as far as I have seen, RNNs usually have few layers e.g. 2-4. For example, for electrocardiogram (ECG) classification, I've seen papers use LSTMs with 4 layers and CNNs with 10-15 layers with similar results.
Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?
deep-learning cnn lstm rnn feature-extraction
New contributor
$endgroup$
CNNs can have hundreds of hidden layers and since they are often used with image data, having many layers captures more complexity.
However, as far as I have seen, RNNs usually have few layers e.g. 2-4. For example, for electrocardiogram (ECG) classification, I've seen papers use LSTMs with 4 layers and CNNs with 10-15 layers with similar results.
Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?
deep-learning cnn lstm rnn feature-extraction
deep-learning cnn lstm rnn feature-extraction
New contributor
New contributor
edited 2 days ago
thanatoz
875526
875526
New contributor
asked Jun 9 at 2:18
KRLKRL
512
512
New contributor
New contributor
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
CNNs and RNNs feature extraction methods:
CNNs tend to extract spatial features. Suppose, we have a total of 10 convolution layers stacked on top of each other. The kernel of the 1st layer will extract features from the input. This feature map is then used as an input for the next convolution layer which then again produces a feature map from its input feature map.
Likewise, features are extracted level-by-level from the input image. If the input is a small image of 32 * 32 pixels, then we will definitely require fewer convolution layers. A bigger image of 256 * 256 will have comparatively higher complexity of features.
RNNs are temporal feature extractors as they hold a memory of the past layer activations. They extract features like an NN, but RNNs remember the extracted features across timesteps. RNNs could also remember features extracted via convolution layers. Since they hold a kind-of memory, they persist in temporally/time features.
In case of electrocardiogram classification:
On the basis of the papers you read, it seems that,
ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.
The CNNs are more complex because,
The feature extraction methods used by CNNs lead to such features
which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.
At last,
A strong feature provides less complexity to the model whereas a
weaker feature needs to be extracted with complex layers.
Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?
This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.
I thank @Ismael EL ATIFI for the corrections.
$endgroup$
4
$begingroup$
"LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
$endgroup$
– Ismael EL ATIFI
Jun 9 at 17:57
add a comment |
$begingroup$
About the number of layers
The reason can be understood by looking at the architecture of a CNN and an LSTM and how the might operated on time-series data. But I should say that the number of layers is something that depends heavily on the problem you are trying to solve. You might be able to solve an ECG classification using few LSTM layers, but for activity recognition from videos you will need more layers.
Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.
For a CNN to see this simple pattern it takes 4 layers in this example. When a CNN processes a time series input, a convolution outputs does not know about the previous outputs (i.e. they are not connected). However an LSTM can do that just using a single layer as they can remember temporal patterns up to 100s of time steps. Because one output is based on the current input as well as the previous inputs the model has seen.
I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.
About vanishing gradients and overfitting
Vanishing gradient is likely to become a problem within a single layer than across layers. That is when processing many sequential steps the knowledge about the first few steps will likely to disappear. And I don't think sequential models are likely to overfit on time-series data if you regularize them correctly. So this choice is probably more influenced by the architecture/capabilities of the models than by the vanishing gradient or overfitting.
New contributor
$endgroup$
add a comment |
$begingroup$
I see 2 possible reasons for why a RNN could necessitate fewer layers than a CNN to reach the same performance :
- RNN layers are generally fully connected layers which have more parameters than a convolutional layer.
- a RNN has some components which allow to combine inputs by multiplication unlike a CNN which can only combine inputs by weighted addition. Thus the multiplicative capacity of the RNN gives him more "power" to compute features. The CNN would need many layers to "imitate" this multiplicative capacity.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
KRL is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f53474%2fwhy-do-rnns-usually-have-fewer-hidden-layers-than-cnns%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
CNNs and RNNs feature extraction methods:
CNNs tend to extract spatial features. Suppose, we have a total of 10 convolution layers stacked on top of each other. The kernel of the 1st layer will extract features from the input. This feature map is then used as an input for the next convolution layer which then again produces a feature map from its input feature map.
Likewise, features are extracted level-by-level from the input image. If the input is a small image of 32 * 32 pixels, then we will definitely require fewer convolution layers. A bigger image of 256 * 256 will have comparatively higher complexity of features.
RNNs are temporal feature extractors as they hold a memory of the past layer activations. They extract features like an NN, but RNNs remember the extracted features across timesteps. RNNs could also remember features extracted via convolution layers. Since they hold a kind-of memory, they persist in temporally/time features.
In case of electrocardiogram classification:
On the basis of the papers you read, it seems that,
ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.
The CNNs are more complex because,
The feature extraction methods used by CNNs lead to such features
which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.
At last,
A strong feature provides less complexity to the model whereas a
weaker feature needs to be extracted with complex layers.
Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?
This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.
I thank @Ismael EL ATIFI for the corrections.
$endgroup$
4
$begingroup$
"LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
$endgroup$
– Ismael EL ATIFI
Jun 9 at 17:57
add a comment |
$begingroup$
CNNs and RNNs feature extraction methods:
CNNs tend to extract spatial features. Suppose, we have a total of 10 convolution layers stacked on top of each other. The kernel of the 1st layer will extract features from the input. This feature map is then used as an input for the next convolution layer which then again produces a feature map from its input feature map.
Likewise, features are extracted level-by-level from the input image. If the input is a small image of 32 * 32 pixels, then we will definitely require fewer convolution layers. A bigger image of 256 * 256 will have comparatively higher complexity of features.
RNNs are temporal feature extractors as they hold a memory of the past layer activations. They extract features like an NN, but RNNs remember the extracted features across timesteps. RNNs could also remember features extracted via convolution layers. Since they hold a kind-of memory, they persist in temporally/time features.
In case of electrocardiogram classification:
On the basis of the papers you read, it seems that,
ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.
The CNNs are more complex because,
The feature extraction methods used by CNNs lead to such features
which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.
At last,
A strong feature provides less complexity to the model whereas a
weaker feature needs to be extracted with complex layers.
Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?
This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.
I thank @Ismael EL ATIFI for the corrections.
$endgroup$
4
$begingroup$
"LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
$endgroup$
– Ismael EL ATIFI
Jun 9 at 17:57
add a comment |
$begingroup$
CNNs and RNNs feature extraction methods:
CNNs tend to extract spatial features. Suppose, we have a total of 10 convolution layers stacked on top of each other. The kernel of the 1st layer will extract features from the input. This feature map is then used as an input for the next convolution layer which then again produces a feature map from its input feature map.
Likewise, features are extracted level-by-level from the input image. If the input is a small image of 32 * 32 pixels, then we will definitely require fewer convolution layers. A bigger image of 256 * 256 will have comparatively higher complexity of features.
RNNs are temporal feature extractors as they hold a memory of the past layer activations. They extract features like an NN, but RNNs remember the extracted features across timesteps. RNNs could also remember features extracted via convolution layers. Since they hold a kind-of memory, they persist in temporally/time features.
In case of electrocardiogram classification:
On the basis of the papers you read, it seems that,
ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.
The CNNs are more complex because,
The feature extraction methods used by CNNs lead to such features
which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.
At last,
A strong feature provides less complexity to the model whereas a
weaker feature needs to be extracted with complex layers.
Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?
This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.
I thank @Ismael EL ATIFI for the corrections.
$endgroup$
CNNs and RNNs feature extraction methods:
CNNs tend to extract spatial features. Suppose, we have a total of 10 convolution layers stacked on top of each other. The kernel of the 1st layer will extract features from the input. This feature map is then used as an input for the next convolution layer which then again produces a feature map from its input feature map.
Likewise, features are extracted level-by-level from the input image. If the input is a small image of 32 * 32 pixels, then we will definitely require fewer convolution layers. A bigger image of 256 * 256 will have comparatively higher complexity of features.
RNNs are temporal feature extractors as they hold a memory of the past layer activations. They extract features like an NN, but RNNs remember the extracted features across timesteps. RNNs could also remember features extracted via convolution layers. Since they hold a kind-of memory, they persist in temporally/time features.
In case of electrocardiogram classification:
On the basis of the papers you read, it seems that,
ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.
The CNNs are more complex because,
The feature extraction methods used by CNNs lead to such features
which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.
At last,
A strong feature provides less complexity to the model whereas a
weaker feature needs to be extracted with complex layers.
Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?
This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.
I thank @Ismael EL ATIFI for the corrections.
edited Jun 10 at 1:06
answered Jun 9 at 7:32
Shubham PanchalShubham Panchal
778113
778113
4
$begingroup$
"LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
$endgroup$
– Ismael EL ATIFI
Jun 9 at 17:57
add a comment |
4
$begingroup$
"LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
$endgroup$
– Ismael EL ATIFI
Jun 9 at 17:57
4
4
$begingroup$
"LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
$endgroup$
– Ismael EL ATIFI
Jun 9 at 17:57
$begingroup$
"LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
$endgroup$
– Ismael EL ATIFI
Jun 9 at 17:57
add a comment |
$begingroup$
About the number of layers
The reason can be understood by looking at the architecture of a CNN and an LSTM and how the might operated on time-series data. But I should say that the number of layers is something that depends heavily on the problem you are trying to solve. You might be able to solve an ECG classification using few LSTM layers, but for activity recognition from videos you will need more layers.
Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.
For a CNN to see this simple pattern it takes 4 layers in this example. When a CNN processes a time series input, a convolution outputs does not know about the previous outputs (i.e. they are not connected). However an LSTM can do that just using a single layer as they can remember temporal patterns up to 100s of time steps. Because one output is based on the current input as well as the previous inputs the model has seen.
I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.
About vanishing gradients and overfitting
Vanishing gradient is likely to become a problem within a single layer than across layers. That is when processing many sequential steps the knowledge about the first few steps will likely to disappear. And I don't think sequential models are likely to overfit on time-series data if you regularize them correctly. So this choice is probably more influenced by the architecture/capabilities of the models than by the vanishing gradient or overfitting.
New contributor
$endgroup$
add a comment |
$begingroup$
About the number of layers
The reason can be understood by looking at the architecture of a CNN and an LSTM and how the might operated on time-series data. But I should say that the number of layers is something that depends heavily on the problem you are trying to solve. You might be able to solve an ECG classification using few LSTM layers, but for activity recognition from videos you will need more layers.
Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.
For a CNN to see this simple pattern it takes 4 layers in this example. When a CNN processes a time series input, a convolution outputs does not know about the previous outputs (i.e. they are not connected). However an LSTM can do that just using a single layer as they can remember temporal patterns up to 100s of time steps. Because one output is based on the current input as well as the previous inputs the model has seen.
I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.
About vanishing gradients and overfitting
Vanishing gradient is likely to become a problem within a single layer than across layers. That is when processing many sequential steps the knowledge about the first few steps will likely to disappear. And I don't think sequential models are likely to overfit on time-series data if you regularize them correctly. So this choice is probably more influenced by the architecture/capabilities of the models than by the vanishing gradient or overfitting.
New contributor
$endgroup$
add a comment |
$begingroup$
About the number of layers
The reason can be understood by looking at the architecture of a CNN and an LSTM and how the might operated on time-series data. But I should say that the number of layers is something that depends heavily on the problem you are trying to solve. You might be able to solve an ECG classification using few LSTM layers, but for activity recognition from videos you will need more layers.
Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.
For a CNN to see this simple pattern it takes 4 layers in this example. When a CNN processes a time series input, a convolution outputs does not know about the previous outputs (i.e. they are not connected). However an LSTM can do that just using a single layer as they can remember temporal patterns up to 100s of time steps. Because one output is based on the current input as well as the previous inputs the model has seen.
I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.
About vanishing gradients and overfitting
Vanishing gradient is likely to become a problem within a single layer than across layers. That is when processing many sequential steps the knowledge about the first few steps will likely to disappear. And I don't think sequential models are likely to overfit on time-series data if you regularize them correctly. So this choice is probably more influenced by the architecture/capabilities of the models than by the vanishing gradient or overfitting.
New contributor
$endgroup$
About the number of layers
The reason can be understood by looking at the architecture of a CNN and an LSTM and how the might operated on time-series data. But I should say that the number of layers is something that depends heavily on the problem you are trying to solve. You might be able to solve an ECG classification using few LSTM layers, but for activity recognition from videos you will need more layers.
Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.
For a CNN to see this simple pattern it takes 4 layers in this example. When a CNN processes a time series input, a convolution outputs does not know about the previous outputs (i.e. they are not connected). However an LSTM can do that just using a single layer as they can remember temporal patterns up to 100s of time steps. Because one output is based on the current input as well as the previous inputs the model has seen.
I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.
About vanishing gradients and overfitting
Vanishing gradient is likely to become a problem within a single layer than across layers. That is when processing many sequential steps the knowledge about the first few steps will likely to disappear. And I don't think sequential models are likely to overfit on time-series data if you regularize them correctly. So this choice is probably more influenced by the architecture/capabilities of the models than by the vanishing gradient or overfitting.
New contributor
New contributor
answered Jun 12 at 4:29
thushv89thushv89
1113
1113
New contributor
New contributor
add a comment |
add a comment |
$begingroup$
I see 2 possible reasons for why a RNN could necessitate fewer layers than a CNN to reach the same performance :
- RNN layers are generally fully connected layers which have more parameters than a convolutional layer.
- a RNN has some components which allow to combine inputs by multiplication unlike a CNN which can only combine inputs by weighted addition. Thus the multiplicative capacity of the RNN gives him more "power" to compute features. The CNN would need many layers to "imitate" this multiplicative capacity.
$endgroup$
add a comment |
$begingroup$
I see 2 possible reasons for why a RNN could necessitate fewer layers than a CNN to reach the same performance :
- RNN layers are generally fully connected layers which have more parameters than a convolutional layer.
- a RNN has some components which allow to combine inputs by multiplication unlike a CNN which can only combine inputs by weighted addition. Thus the multiplicative capacity of the RNN gives him more "power" to compute features. The CNN would need many layers to "imitate" this multiplicative capacity.
$endgroup$
add a comment |
$begingroup$
I see 2 possible reasons for why a RNN could necessitate fewer layers than a CNN to reach the same performance :
- RNN layers are generally fully connected layers which have more parameters than a convolutional layer.
- a RNN has some components which allow to combine inputs by multiplication unlike a CNN which can only combine inputs by weighted addition. Thus the multiplicative capacity of the RNN gives him more "power" to compute features. The CNN would need many layers to "imitate" this multiplicative capacity.
$endgroup$
I see 2 possible reasons for why a RNN could necessitate fewer layers than a CNN to reach the same performance :
- RNN layers are generally fully connected layers which have more parameters than a convolutional layer.
- a RNN has some components which allow to combine inputs by multiplication unlike a CNN which can only combine inputs by weighted addition. Thus the multiplicative capacity of the RNN gives him more "power" to compute features. The CNN would need many layers to "imitate" this multiplicative capacity.
answered Jun 9 at 18:31
Ismael EL ATIFIIsmael EL ATIFI
1764
1764
add a comment |
add a comment |
KRL is a new contributor. Be nice, and check out our Code of Conduct.
KRL is a new contributor. Be nice, and check out our Code of Conduct.
KRL is a new contributor. Be nice, and check out our Code of Conduct.
KRL is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f53474%2fwhy-do-rnns-usually-have-fewer-hidden-layers-than-cnns%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown