Why do RNNs usually have fewer hidden layers than CNNs?Convnet training error does not decreaseWhy are RNN/LSTM preferred in time series analysis and not other NN?Is there an intuitive explanation why some neural networks have more than one fully connected layers?Understand the shape of this Convolutional Neural NetworkHow does weight sharing stabilize the learning in RNNs and CNNs?RNN model with 3 hidden layersThe mix of leaky Relu at the first layers of CNN along with conventional Relu for object detectionUsing deep learning to classify similar imagesUsing t-SNE to track progress of a word vector embedding model. Pitfalls?Why do recurrent layers work better than simple feed-forward networks?

How to avoid offending original culture when making conculture inspired from original

What kind of chart is this?

Is the infant mortality rate among African-American babies in Youngstown, Ohio greater than that of babies in Iran?

On George Box, Galit Shmueli and the scientific method?

Digital signature that is only verifiable by one specific person

How to address players struggling with simple controls?

100-doors puzzle

Redirecting output only on a successful command call

Have Steve Rogers (Captain America) and a young Erik Lehnsherr (Magneto) interacted during WWII?

Manager wants to hire me; HR does not. How to proceed?

What is this plant I saw for sale at a Romanian farmer's market?

How can this shape perfectly cover a cube?

Interview was just a one hour panel. Got an offer the next day; do I accept or is this a red flag?

Using roof rails to set up hammock

What is the context for Napoleon's quote "[the Austrians] did not know the value of five minutes"?

Why do you need to heat the pan before heating the olive oil?

Can a non-invertible function be inverted by returning a set of all possible solutions?

How to make all magic-casting innate, but still rare?

1960s sci-fi anthology with a Viking fighting a U.S. army MP on the cover

What does a/.b[c][[1]] mean?

How can Caller ID be faked?

New Site Design!

Catching a robber on one line

Lead the way to this Literary Knight to its final “DESTINATION”

Why do RNNs usually have fewer hidden layers than CNNs?

Convnet training error does not decreaseWhy are RNN/LSTM preferred in time series analysis and not other NN?Is there an intuitive explanation why some neural networks have more than one fully connected layers?Understand the shape of this Convolutional Neural NetworkHow does weight sharing stabilize the learning in RNNs and CNNs?RNN model with 3 hidden layersThe mix of leaky Relu at the first layers of CNN along with conventional Relu for object detectionUsing deep learning to classify similar imagesUsing t-SNE to track progress of a word vector embedding model. Pitfalls?Why do recurrent layers work better than simple feed-forward networks?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

CNNs can have hundreds of hidden layers and since they are often used with image data, having many layers captures more complexity.

However, as far as I have seen, RNNs usually have few layers e.g. 2-4. For example, for electrocardiogram (ECG) classification, I've seen papers use LSTMs with 4 layers and CNNs with 10-15 layers with similar results.

Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?

edited 2 days ago

thanatoz

875526

asked Jun 9 at 2:18

KRL

512

New contributor

add a comment |

CNNs can have hundreds of hidden layers and since they are often used with image data, having many layers captures more complexity.

Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?

edited 2 days ago

thanatoz

875526

asked Jun 9 at 2:18

KRL

512

New contributor

add a comment |

CNNs can have hundreds of hidden layers and since they are often used with image data, having many layers captures more complexity.

Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?

edited 2 days ago

thanatoz

875526

asked Jun 9 at 2:18

KRL

512

New contributor

CNNs can have hundreds of hidden layers and since they are often used with image data, having many layers captures more complexity.

Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?

deep-learning cnn lstm rnn feature-extraction

edited 2 days ago

thanatoz

875526

asked Jun 9 at 2:18

KRL

512

New contributor

edited 2 days ago

thanatoz

875526

asked Jun 9 at 2:18

KRL

512

New contributor

edited 2 days ago

thanatoz

875526

edited 2 days ago

thanatoz

875526

edited 2 days ago

thanatoz

875526

asked Jun 9 at 2:18

KRL

512

New contributor

asked Jun 9 at 2:18

KRL

512

asked Jun 9 at 2:18

KRL

512

New contributor

add a comment |

3 Answers
3

active

oldest

votes

CNNs and RNNs feature extraction methods:

CNNs tend to extract spatial features. Suppose, we have a total of 10 convolution layers stacked on top of each other. The kernel of the 1st layer will extract features from the input. This feature map is then used as an input for the next convolution layer which then again produces a feature map from its input feature map.

Likewise, features are extracted level-by-level from the input image. If the input is a small image of 32 * 32 pixels, then we will definitely require fewer convolution layers. A bigger image of 256 * 256 will have comparatively higher complexity of features.

RNNs are temporal feature extractors as they hold a memory of the past layer activations. They extract features like an NN, but RNNs remember the extracted features across timesteps. RNNs could also remember features extracted via convolution layers. Since they hold a kind-of memory, they persist in temporally/time features.

In case of electrocardiogram classification:

On the basis of the papers you read, it seems that,

ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.

The CNNs are more complex because,

The feature extraction methods used by CNNs lead to such features
which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.

At last,

A strong feature provides less complexity to the model whereas a
weaker feature needs to be extracted with complex layers.

Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?

This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.

I thank @Ismael EL ATIFI for the corrections.

edited Jun 10 at 1:06

answered Jun 9 at 7:32

Shubham Panchal

778113

4

$begingroup$
"LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
$endgroup$
– Ismael EL ATIFI
Jun 9 at 17:57

add a comment |

About the number of layers

The reason can be understood by looking at the architecture of a CNN and an LSTM and how the might operated on time-series data. But I should say that the number of layers is something that depends heavily on the problem you are trying to solve. You might be able to solve an ECG classification using few LSTM layers, but for activity recognition from videos you will need more layers.

Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.

CNN vs LSTMs

For a CNN to see this simple pattern it takes 4 layers in this example. When a CNN processes a time series input, a convolution outputs does not know about the previous outputs (i.e. they are not connected). However an LSTM can do that just using a single layer as they can remember temporal patterns up to 100s of time steps. Because one output is based on the current input as well as the previous inputs the model has seen.

I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.

About vanishing gradients and overfitting

Vanishing gradient is likely to become a problem within a single layer than across layers. That is when processing many sequential steps the knowledge about the first few steps will likely to disappear. And I don't think sequential models are likely to overfit on time-series data if you regularize them correctly. So this choice is probably more influenced by the architecture/capabilities of the models than by the vanishing gradient or overfitting.

answered Jun 12 at 4:29

thushv89

1113

New contributor

add a comment |

I see 2 possible reasons for why a RNN could necessitate fewer layers than a CNN to reach the same performance :

- RNN layers are generally fully connected layers which have more parameters than a convolutional layer.

- a RNN has some components which allow to combine inputs by multiplication unlike a CNN which can only combine inputs by weighted addition. Thus the multiplicative capacity of the RNN gives him more "power" to compute features. The CNN would need many layers to "imitate" this multiplicative capacity.

answered Jun 9 at 18:31

Ismael EL ATIFI

1764

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

KRL is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f53474%2fwhy-do-rnns-usually-have-fewer-hidden-layers-than-cnns%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

CNNs and RNNs feature extraction methods:

In case of electrocardiogram classification:

On the basis of the papers you read, it seems that,

ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.

The CNNs are more complex because,

The feature extraction methods used by CNNs lead to such features
which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.

At last,

A strong feature provides less complexity to the model whereas a
weaker feature needs to be extracted with complex layers.

Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?

This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.

I thank @Ismael EL ATIFI for the corrections.

edited Jun 10 at 1:06

answered Jun 9 at 7:32

Shubham Panchal

778113

4

$begingroup$
"LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
$endgroup$
– Ismael EL ATIFI
Jun 9 at 17:57

add a comment |

CNNs and RNNs feature extraction methods:

In case of electrocardiogram classification:

On the basis of the papers you read, it seems that,

ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.

The CNNs are more complex because,

The feature extraction methods used by CNNs lead to such features
which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.

At last,

A strong feature provides less complexity to the model whereas a
weaker feature needs to be extracted with complex layers.

Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?

This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.

I thank @Ismael EL ATIFI for the corrections.

edited Jun 10 at 1:06

answered Jun 9 at 7:32

Shubham Panchal

778113

4

$begingroup$
"LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
$endgroup$
– Ismael EL ATIFI
Jun 9 at 17:57

add a comment |

CNNs and RNNs feature extraction methods:

In case of electrocardiogram classification:

On the basis of the papers you read, it seems that,

ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.

The CNNs are more complex because,

The feature extraction methods used by CNNs lead to such features
which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.

At last,

A strong feature provides less complexity to the model whereas a
weaker feature needs to be extracted with complex layers.

Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?

This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.

I thank @Ismael EL ATIFI for the corrections.

edited Jun 10 at 1:06

answered Jun 9 at 7:32

Shubham Panchal

778113

CNNs and RNNs feature extraction methods:

In case of electrocardiogram classification:

On the basis of the papers you read, it seems that,

ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.

The CNNs are more complex because,

The feature extraction methods used by CNNs lead to such features
which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.

At last,

A strong feature provides less complexity to the model whereas a
weaker feature needs to be extracted with complex layers.

Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?

This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.

I thank @Ismael EL ATIFI for the corrections.

edited Jun 10 at 1:06

answered Jun 9 at 7:32

Shubham Panchal

778113

edited Jun 10 at 1:06

answered Jun 9 at 7:32

Shubham Panchal

778113

answered Jun 9 at 7:32

Shubham Panchal

778113

answered Jun 9 at 7:32

Shubham Panchal

778113

4

$begingroup$
"LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
$endgroup$
– Ismael EL ATIFI
Jun 9 at 17:57

add a comment |

4

$begingroup$
"LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
$endgroup$
– Ismael EL ATIFI
Jun 9 at 17:57

"LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.

– Ismael EL ATIFI
Jun 9 at 17:57

add a comment |

About the number of layers

Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.

CNN vs LSTMs

I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.

About vanishing gradients and overfitting

answered Jun 12 at 4:29

thushv89

1113

New contributor

add a comment |

About the number of layers

Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.

CNN vs LSTMs

I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.

About vanishing gradients and overfitting

answered Jun 12 at 4:29

thushv89

1113

New contributor

add a comment |

About the number of layers

Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.

CNN vs LSTMs

I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.

About vanishing gradients and overfitting

answered Jun 12 at 4:29

thushv89

1113

New contributor

About the number of layers

Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.

CNN vs LSTMs

I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.

About vanishing gradients and overfitting

answered Jun 12 at 4:29

thushv89

1113

New contributor

answered Jun 12 at 4:29

thushv89

1113

New contributor

answered Jun 12 at 4:29

thushv89

1113

answered Jun 12 at 4:29

thushv89

1113

New contributor

add a comment |

answered Jun 9 at 18:31

Ismael EL ATIFI

1764

add a comment |

answered Jun 9 at 18:31

Ismael EL ATIFI

1764

add a comment |

answered Jun 9 at 18:31

Ismael EL ATIFI

1764

answered Jun 9 at 18:31

Ismael EL ATIFI

1764

answered Jun 9 at 18:31

Ismael EL ATIFI

1764

answered Jun 9 at 18:31

Ismael EL ATIFI

1764

answered Jun 9 at 18:31

Ismael EL ATIFI

1764

add a comment |

KRL is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

KRL is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

h,N3j6GT3v5vh,nC,z H35w75

搜尋此網誌

Ttdfjt

3 Answers
3

About the number of layers

About vanishing gradients and overfitting

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

About the number of layers

About vanishing gradients and overfitting

About the number of layers

About vanishing gradients and overfitting

About the number of layers

About vanishing gradients and overfitting

About the number of layers

About vanishing gradients and overfitting

Post as a guest

Popular posts from this blog

3 Answers 3

About the number of layers

About vanishing gradients and overfitting

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

About the number of layers

About vanishing gradients and overfitting

About the number of layers

About vanishing gradients and overfitting

About the number of layers

About vanishing gradients and overfitting

About the number of layers

About vanishing gradients and overfitting

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

3 Answers
3

3 Answers
3

3 Answers
3