Why do RNNs usually have fewer hidden layers than CNNs?Convnet training error does not decreaseWhy are RNN/LSTM preferred in time series analysis and not other NN?Is there an intuitive explanation why some neural networks have more than one fully connected layers?Understand the shape of this Convolutional Neural NetworkHow does weight sharing stabilize the learning in RNNs and CNNs?RNN model with 3 hidden layersThe mix of leaky Relu at the first layers of CNN along with conventional Relu for object detectionUsing deep learning to classify similar imagesUsing t-SNE to track progress of a word vector embedding model. Pitfalls?Why do recurrent layers work better than simple feed-forward networks?

How to avoid offending original culture when making conculture inspired from original

What kind of chart is this?

Is the infant mortality rate among African-American babies in Youngstown, Ohio greater than that of babies in Iran?

On George Box, Galit Shmueli and the scientific method?

Digital signature that is only verifiable by one specific person

How to address players struggling with simple controls?

100-doors puzzle

Redirecting output only on a successful command call

Have Steve Rogers (Captain America) and a young Erik Lehnsherr (Magneto) interacted during WWII?

Manager wants to hire me; HR does not. How to proceed?

What is this plant I saw for sale at a Romanian farmer's market?

How can this shape perfectly cover a cube?

Interview was just a one hour panel. Got an offer the next day; do I accept or is this a red flag?

Using roof rails to set up hammock

What is the context for Napoleon's quote "[the Austrians] did not know the value of five minutes"?

Why do you need to heat the pan before heating the olive oil?

Can a non-invertible function be inverted by returning a set of all possible solutions?

How to make all magic-casting innate, but still rare?

1960s sci-fi anthology with a Viking fighting a U.S. army MP on the cover

What does a/.b[c][[1]] mean?

How can Caller ID be faked?

New Site Design!

Catching a robber on one line

Lead the way to this Literary Knight to its final “DESTINATION”



Why do RNNs usually have fewer hidden layers than CNNs?


Convnet training error does not decreaseWhy are RNN/LSTM preferred in time series analysis and not other NN?Is there an intuitive explanation why some neural networks have more than one fully connected layers?Understand the shape of this Convolutional Neural NetworkHow does weight sharing stabilize the learning in RNNs and CNNs?RNN model with 3 hidden layersThe mix of leaky Relu at the first layers of CNN along with conventional Relu for object detectionUsing deep learning to classify similar imagesUsing t-SNE to track progress of a word vector embedding model. Pitfalls?Why do recurrent layers work better than simple feed-forward networks?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








8












$begingroup$


CNNs can have hundreds of hidden layers and since they are often used with image data, having many layers captures more complexity.



However, as far as I have seen, RNNs usually have few layers e.g. 2-4. For example, for electrocardiogram (ECG) classification, I've seen papers use LSTMs with 4 layers and CNNs with 10-15 layers with similar results.



Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?










share|improve this question









New contributor



KRL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






$endgroup$


















    8












    $begingroup$


    CNNs can have hundreds of hidden layers and since they are often used with image data, having many layers captures more complexity.



    However, as far as I have seen, RNNs usually have few layers e.g. 2-4. For example, for electrocardiogram (ECG) classification, I've seen papers use LSTMs with 4 layers and CNNs with 10-15 layers with similar results.



    Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?










    share|improve this question









    New contributor



    KRL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.






    $endgroup$














      8












      8








      8


      2



      $begingroup$


      CNNs can have hundreds of hidden layers and since they are often used with image data, having many layers captures more complexity.



      However, as far as I have seen, RNNs usually have few layers e.g. 2-4. For example, for electrocardiogram (ECG) classification, I've seen papers use LSTMs with 4 layers and CNNs with 10-15 layers with similar results.



      Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?










      share|improve this question









      New contributor



      KRL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






      $endgroup$




      CNNs can have hundreds of hidden layers and since they are often used with image data, having many layers captures more complexity.



      However, as far as I have seen, RNNs usually have few layers e.g. 2-4. For example, for electrocardiogram (ECG) classification, I've seen papers use LSTMs with 4 layers and CNNs with 10-15 layers with similar results.



      Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?







      deep-learning cnn lstm rnn feature-extraction






      share|improve this question









      New contributor



      KRL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.










      share|improve this question









      New contributor



      KRL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      share|improve this question




      share|improve this question








      edited 2 days ago









      thanatoz

      875526




      875526






      New contributor



      KRL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.








      asked Jun 9 at 2:18









      KRLKRL

      512




      512




      New contributor



      KRL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.




      New contributor




      KRL is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.






















          3 Answers
          3






          active

          oldest

          votes


















          3












          $begingroup$

          CNNs and RNNs feature extraction methods:



          CNNs tend to extract spatial features. Suppose, we have a total of 10 convolution layers stacked on top of each other. The kernel of the 1st layer will extract features from the input. This feature map is then used as an input for the next convolution layer which then again produces a feature map from its input feature map.



          Likewise, features are extracted level-by-level from the input image. If the input is a small image of 32 * 32 pixels, then we will definitely require fewer convolution layers. A bigger image of 256 * 256 will have comparatively higher complexity of features.



          RNNs are temporal feature extractors as they hold a memory of the past layer activations. They extract features like an NN, but RNNs remember the extracted features across timesteps. RNNs could also remember features extracted via convolution layers. Since they hold a kind-of memory, they persist in temporally/time features.




          In case of electrocardiogram classification:



          On the basis of the papers you read, it seems that,




          ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.




          The CNNs are more complex because,




          The feature extraction methods used by CNNs lead to such features
          which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.




          At last,




          A strong feature provides less complexity to the model whereas a
          weaker feature needs to be extracted with complex layers.





          Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?



          This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.



          I thank @Ismael EL ATIFI for the corrections.






          share|improve this answer











          $endgroup$








          • 4




            $begingroup$
            "LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
            $endgroup$
            – Ismael EL ATIFI
            Jun 9 at 17:57



















          1












          $begingroup$

          About the number of layers



          The reason can be understood by looking at the architecture of a CNN and an LSTM and how the might operated on time-series data. But I should say that the number of layers is something that depends heavily on the problem you are trying to solve. You might be able to solve an ECG classification using few LSTM layers, but for activity recognition from videos you will need more layers.



          Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.



          CNN vs LSTMs



          For a CNN to see this simple pattern it takes 4 layers in this example. When a CNN processes a time series input, a convolution outputs does not know about the previous outputs (i.e. they are not connected). However an LSTM can do that just using a single layer as they can remember temporal patterns up to 100s of time steps. Because one output is based on the current input as well as the previous inputs the model has seen.



          I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.



          About vanishing gradients and overfitting



          Vanishing gradient is likely to become a problem within a single layer than across layers. That is when processing many sequential steps the knowledge about the first few steps will likely to disappear. And I don't think sequential models are likely to overfit on time-series data if you regularize them correctly. So this choice is probably more influenced by the architecture/capabilities of the models than by the vanishing gradient or overfitting.






          share|improve this answer








          New contributor



          thushv89 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.





          $endgroup$




















            0












            $begingroup$

            I see 2 possible reasons for why a RNN could necessitate fewer layers than a CNN to reach the same performance :

            - RNN layers are generally fully connected layers which have more parameters than a convolutional layer.

            - a RNN has some components which allow to combine inputs by multiplication unlike a CNN which can only combine inputs by weighted addition. Thus the multiplicative capacity of the RNN gives him more "power" to compute features. The CNN would need many layers to "imitate" this multiplicative capacity.






            share|improve this answer









            $endgroup$













              Your Answer








              StackExchange.ready(function()
              var channelOptions =
              tags: "".split(" "),
              id: "557"
              ;
              initTagRenderer("".split(" "), "".split(" "), channelOptions);

              StackExchange.using("externalEditor", function()
              // Have to fire editor after snippets, if snippets enabled
              if (StackExchange.settings.snippets.snippetsEnabled)
              StackExchange.using("snippets", function()
              createEditor();
              );

              else
              createEditor();

              );

              function createEditor()
              StackExchange.prepareEditor(
              heartbeatType: 'answer',
              autoActivateHeartbeat: false,
              convertImagesToLinks: false,
              noModals: true,
              showLowRepImageUploadWarning: true,
              reputationToPostImages: null,
              bindNavPrevention: true,
              postfix: "",
              imageUploader:
              brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
              contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
              allowUrls: true
              ,
              onDemand: true,
              discardSelector: ".discard-answer"
              ,immediatelyShowMarkdownHelp:true
              );



              );






              KRL is a new contributor. Be nice, and check out our Code of Conduct.









              draft saved

              draft discarded


















              StackExchange.ready(
              function ()
              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f53474%2fwhy-do-rnns-usually-have-fewer-hidden-layers-than-cnns%23new-answer', 'question_page');

              );

              Post as a guest















              Required, but never shown

























              3 Answers
              3






              active

              oldest

              votes








              3 Answers
              3






              active

              oldest

              votes









              active

              oldest

              votes






              active

              oldest

              votes









              3












              $begingroup$

              CNNs and RNNs feature extraction methods:



              CNNs tend to extract spatial features. Suppose, we have a total of 10 convolution layers stacked on top of each other. The kernel of the 1st layer will extract features from the input. This feature map is then used as an input for the next convolution layer which then again produces a feature map from its input feature map.



              Likewise, features are extracted level-by-level from the input image. If the input is a small image of 32 * 32 pixels, then we will definitely require fewer convolution layers. A bigger image of 256 * 256 will have comparatively higher complexity of features.



              RNNs are temporal feature extractors as they hold a memory of the past layer activations. They extract features like an NN, but RNNs remember the extracted features across timesteps. RNNs could also remember features extracted via convolution layers. Since they hold a kind-of memory, they persist in temporally/time features.




              In case of electrocardiogram classification:



              On the basis of the papers you read, it seems that,




              ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.




              The CNNs are more complex because,




              The feature extraction methods used by CNNs lead to such features
              which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.




              At last,




              A strong feature provides less complexity to the model whereas a
              weaker feature needs to be extracted with complex layers.





              Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?



              This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.



              I thank @Ismael EL ATIFI for the corrections.






              share|improve this answer











              $endgroup$








              • 4




                $begingroup$
                "LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
                $endgroup$
                – Ismael EL ATIFI
                Jun 9 at 17:57
















              3












              $begingroup$

              CNNs and RNNs feature extraction methods:



              CNNs tend to extract spatial features. Suppose, we have a total of 10 convolution layers stacked on top of each other. The kernel of the 1st layer will extract features from the input. This feature map is then used as an input for the next convolution layer which then again produces a feature map from its input feature map.



              Likewise, features are extracted level-by-level from the input image. If the input is a small image of 32 * 32 pixels, then we will definitely require fewer convolution layers. A bigger image of 256 * 256 will have comparatively higher complexity of features.



              RNNs are temporal feature extractors as they hold a memory of the past layer activations. They extract features like an NN, but RNNs remember the extracted features across timesteps. RNNs could also remember features extracted via convolution layers. Since they hold a kind-of memory, they persist in temporally/time features.




              In case of electrocardiogram classification:



              On the basis of the papers you read, it seems that,




              ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.




              The CNNs are more complex because,




              The feature extraction methods used by CNNs lead to such features
              which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.




              At last,




              A strong feature provides less complexity to the model whereas a
              weaker feature needs to be extracted with complex layers.





              Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?



              This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.



              I thank @Ismael EL ATIFI for the corrections.






              share|improve this answer











              $endgroup$








              • 4




                $begingroup$
                "LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
                $endgroup$
                – Ismael EL ATIFI
                Jun 9 at 17:57














              3












              3








              3





              $begingroup$

              CNNs and RNNs feature extraction methods:



              CNNs tend to extract spatial features. Suppose, we have a total of 10 convolution layers stacked on top of each other. The kernel of the 1st layer will extract features from the input. This feature map is then used as an input for the next convolution layer which then again produces a feature map from its input feature map.



              Likewise, features are extracted level-by-level from the input image. If the input is a small image of 32 * 32 pixels, then we will definitely require fewer convolution layers. A bigger image of 256 * 256 will have comparatively higher complexity of features.



              RNNs are temporal feature extractors as they hold a memory of the past layer activations. They extract features like an NN, but RNNs remember the extracted features across timesteps. RNNs could also remember features extracted via convolution layers. Since they hold a kind-of memory, they persist in temporally/time features.




              In case of electrocardiogram classification:



              On the basis of the papers you read, it seems that,




              ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.




              The CNNs are more complex because,




              The feature extraction methods used by CNNs lead to such features
              which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.




              At last,




              A strong feature provides less complexity to the model whereas a
              weaker feature needs to be extracted with complex layers.





              Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?



              This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.



              I thank @Ismael EL ATIFI for the corrections.






              share|improve this answer











              $endgroup$



              CNNs and RNNs feature extraction methods:



              CNNs tend to extract spatial features. Suppose, we have a total of 10 convolution layers stacked on top of each other. The kernel of the 1st layer will extract features from the input. This feature map is then used as an input for the next convolution layer which then again produces a feature map from its input feature map.



              Likewise, features are extracted level-by-level from the input image. If the input is a small image of 32 * 32 pixels, then we will definitely require fewer convolution layers. A bigger image of 256 * 256 will have comparatively higher complexity of features.



              RNNs are temporal feature extractors as they hold a memory of the past layer activations. They extract features like an NN, but RNNs remember the extracted features across timesteps. RNNs could also remember features extracted via convolution layers. Since they hold a kind-of memory, they persist in temporally/time features.




              In case of electrocardiogram classification:



              On the basis of the papers you read, it seems that,




              ECG data could be easily classified using temporal features with the help of RNNs. Temporal features are helping the model to classify the ECGs correctly. Hence, the usage of RNNs is less complex.




              The CNNs are more complex because,




              The feature extraction methods used by CNNs lead to such features
              which are not powerful enough to uniquely recognize ECGs. Hence, the larger number of convolution layers is required to extract those minor features for better classification.




              At last,




              A strong feature provides less complexity to the model whereas a
              weaker feature needs to be extracted with complex layers.





              Is this because RNNs/LSTMs are harder to train if they are deeper (due to gradient vanishing problems) or because RNNs/LSTMs tend to overfit sequential data fast?



              This could be taken as a thinking perspective. LSTM/RNNs are prone to overfitting in which one of the reasons could be vanishing gradient problem as mentioned by @Ismael EL ATIFI in the comments.



              I thank @Ismael EL ATIFI for the corrections.







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Jun 10 at 1:06

























              answered Jun 9 at 7:32









              Shubham PanchalShubham Panchal

              778113




              778113







              • 4




                $begingroup$
                "LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
                $endgroup$
                – Ismael EL ATIFI
                Jun 9 at 17:57













              • 4




                $begingroup$
                "LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
                $endgroup$
                – Ismael EL ATIFI
                Jun 9 at 17:57








              4




              4




              $begingroup$
              "LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
              $endgroup$
              – Ismael EL ATIFI
              Jun 9 at 17:57





              $begingroup$
              "LSTM/RNNs are prone to overfitting because of vanishing gradient problem." I disagree. Overvitting can not be caused by vanishing gradient problem just because vanishing gradient prevents parameters of the early layers to be properly updated and thus to overfit. "Convolution layers don't generally overfit, they are feature extractors." Convolution layers CAN overfit just as any other trainable layer and any CNN will definetely overfit if it has too many parameters compared to the quantity and variety of data it is trained on.
              $endgroup$
              – Ismael EL ATIFI
              Jun 9 at 17:57














              1












              $begingroup$

              About the number of layers



              The reason can be understood by looking at the architecture of a CNN and an LSTM and how the might operated on time-series data. But I should say that the number of layers is something that depends heavily on the problem you are trying to solve. You might be able to solve an ECG classification using few LSTM layers, but for activity recognition from videos you will need more layers.



              Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.



              CNN vs LSTMs



              For a CNN to see this simple pattern it takes 4 layers in this example. When a CNN processes a time series input, a convolution outputs does not know about the previous outputs (i.e. they are not connected). However an LSTM can do that just using a single layer as they can remember temporal patterns up to 100s of time steps. Because one output is based on the current input as well as the previous inputs the model has seen.



              I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.



              About vanishing gradients and overfitting



              Vanishing gradient is likely to become a problem within a single layer than across layers. That is when processing many sequential steps the knowledge about the first few steps will likely to disappear. And I don't think sequential models are likely to overfit on time-series data if you regularize them correctly. So this choice is probably more influenced by the architecture/capabilities of the models than by the vanishing gradient or overfitting.






              share|improve this answer








              New contributor



              thushv89 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
              Check out our Code of Conduct.





              $endgroup$

















                1












                $begingroup$

                About the number of layers



                The reason can be understood by looking at the architecture of a CNN and an LSTM and how the might operated on time-series data. But I should say that the number of layers is something that depends heavily on the problem you are trying to solve. You might be able to solve an ECG classification using few LSTM layers, but for activity recognition from videos you will need more layers.



                Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.



                CNN vs LSTMs



                For a CNN to see this simple pattern it takes 4 layers in this example. When a CNN processes a time series input, a convolution outputs does not know about the previous outputs (i.e. they are not connected). However an LSTM can do that just using a single layer as they can remember temporal patterns up to 100s of time steps. Because one output is based on the current input as well as the previous inputs the model has seen.



                I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.



                About vanishing gradients and overfitting



                Vanishing gradient is likely to become a problem within a single layer than across layers. That is when processing many sequential steps the knowledge about the first few steps will likely to disappear. And I don't think sequential models are likely to overfit on time-series data if you regularize them correctly. So this choice is probably more influenced by the architecture/capabilities of the models than by the vanishing gradient or overfitting.






                share|improve this answer








                New contributor



                thushv89 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                Check out our Code of Conduct.





                $endgroup$















                  1












                  1








                  1





                  $begingroup$

                  About the number of layers



                  The reason can be understood by looking at the architecture of a CNN and an LSTM and how the might operated on time-series data. But I should say that the number of layers is something that depends heavily on the problem you are trying to solve. You might be able to solve an ECG classification using few LSTM layers, but for activity recognition from videos you will need more layers.



                  Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.



                  CNN vs LSTMs



                  For a CNN to see this simple pattern it takes 4 layers in this example. When a CNN processes a time series input, a convolution outputs does not know about the previous outputs (i.e. they are not connected). However an LSTM can do that just using a single layer as they can remember temporal patterns up to 100s of time steps. Because one output is based on the current input as well as the previous inputs the model has seen.



                  I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.



                  About vanishing gradients and overfitting



                  Vanishing gradient is likely to become a problem within a single layer than across layers. That is when processing many sequential steps the knowledge about the first few steps will likely to disappear. And I don't think sequential models are likely to overfit on time-series data if you regularize them correctly. So this choice is probably more influenced by the architecture/capabilities of the models than by the vanishing gradient or overfitting.






                  share|improve this answer








                  New contributor



                  thushv89 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.





                  $endgroup$



                  About the number of layers



                  The reason can be understood by looking at the architecture of a CNN and an LSTM and how the might operated on time-series data. But I should say that the number of layers is something that depends heavily on the problem you are trying to solve. You might be able to solve an ECG classification using few LSTM layers, but for activity recognition from videos you will need more layers.



                  Putting that aside, here's how a CNN and an LSTM might process a time series signal. A very simple signal where after three positive cycles you get a negative cycle.



                  CNN vs LSTMs



                  For a CNN to see this simple pattern it takes 4 layers in this example. When a CNN processes a time series input, a convolution outputs does not know about the previous outputs (i.e. they are not connected). However an LSTM can do that just using a single layer as they can remember temporal patterns up to 100s of time steps. Because one output is based on the current input as well as the previous inputs the model has seen.



                  I am not saying this is the only reason, but it is probably one of the main factors why CNNs require more layers and LSTMs don't for time series data.



                  About vanishing gradients and overfitting



                  Vanishing gradient is likely to become a problem within a single layer than across layers. That is when processing many sequential steps the knowledge about the first few steps will likely to disappear. And I don't think sequential models are likely to overfit on time-series data if you regularize them correctly. So this choice is probably more influenced by the architecture/capabilities of the models than by the vanishing gradient or overfitting.







                  share|improve this answer








                  New contributor



                  thushv89 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.








                  share|improve this answer



                  share|improve this answer






                  New contributor



                  thushv89 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.








                  answered Jun 12 at 4:29









                  thushv89thushv89

                  1113




                  1113




                  New contributor



                  thushv89 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.




                  New contributor




                  thushv89 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
                  Check out our Code of Conduct.























                      0












                      $begingroup$

                      I see 2 possible reasons for why a RNN could necessitate fewer layers than a CNN to reach the same performance :

                      - RNN layers are generally fully connected layers which have more parameters than a convolutional layer.

                      - a RNN has some components which allow to combine inputs by multiplication unlike a CNN which can only combine inputs by weighted addition. Thus the multiplicative capacity of the RNN gives him more "power" to compute features. The CNN would need many layers to "imitate" this multiplicative capacity.






                      share|improve this answer









                      $endgroup$

















                        0












                        $begingroup$

                        I see 2 possible reasons for why a RNN could necessitate fewer layers than a CNN to reach the same performance :

                        - RNN layers are generally fully connected layers which have more parameters than a convolutional layer.

                        - a RNN has some components which allow to combine inputs by multiplication unlike a CNN which can only combine inputs by weighted addition. Thus the multiplicative capacity of the RNN gives him more "power" to compute features. The CNN would need many layers to "imitate" this multiplicative capacity.






                        share|improve this answer









                        $endgroup$















                          0












                          0








                          0





                          $begingroup$

                          I see 2 possible reasons for why a RNN could necessitate fewer layers than a CNN to reach the same performance :

                          - RNN layers are generally fully connected layers which have more parameters than a convolutional layer.

                          - a RNN has some components which allow to combine inputs by multiplication unlike a CNN which can only combine inputs by weighted addition. Thus the multiplicative capacity of the RNN gives him more "power" to compute features. The CNN would need many layers to "imitate" this multiplicative capacity.






                          share|improve this answer









                          $endgroup$



                          I see 2 possible reasons for why a RNN could necessitate fewer layers than a CNN to reach the same performance :

                          - RNN layers are generally fully connected layers which have more parameters than a convolutional layer.

                          - a RNN has some components which allow to combine inputs by multiplication unlike a CNN which can only combine inputs by weighted addition. Thus the multiplicative capacity of the RNN gives him more "power" to compute features. The CNN would need many layers to "imitate" this multiplicative capacity.







                          share|improve this answer












                          share|improve this answer



                          share|improve this answer










                          answered Jun 9 at 18:31









                          Ismael EL ATIFIIsmael EL ATIFI

                          1764




                          1764




















                              KRL is a new contributor. Be nice, and check out our Code of Conduct.









                              draft saved

                              draft discarded


















                              KRL is a new contributor. Be nice, and check out our Code of Conduct.












                              KRL is a new contributor. Be nice, and check out our Code of Conduct.











                              KRL is a new contributor. Be nice, and check out our Code of Conduct.














                              Thanks for contributing an answer to Data Science Stack Exchange!


                              • Please be sure to answer the question. Provide details and share your research!

                              But avoid


                              • Asking for help, clarification, or responding to other answers.

                              • Making statements based on opinion; back them up with references or personal experience.

                              Use MathJax to format equations. MathJax reference.


                              To learn more, see our tips on writing great answers.




                              draft saved


                              draft discarded














                              StackExchange.ready(
                              function ()
                              StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f53474%2fwhy-do-rnns-usually-have-fewer-hidden-layers-than-cnns%23new-answer', 'question_page');

                              );

                              Post as a guest















                              Required, but never shown





















































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown

































                              Required, but never shown














                              Required, but never shown












                              Required, but never shown







                              Required, but never shown







                              Popular posts from this blog

                              Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

                              Circuit construction for execution of conditional statements using least significant bitHow are two different registers being used as “control”?How exactly is the stated composite state of the two registers being produced using the $R_zz$ controlled rotations?Efficiently performing controlled rotations in HHLWould this quantum algorithm implementation work?How to prepare a superposed states of odd integers from $1$ to $sqrtN$?Why is this implementation of the order finding algorithm not working?Circuit construction for Hamiltonian simulationHow can I invert the least significant bit of a certain term of a superposed state?Implementing an oracleImplementing a controlled sum operation

                              Magento 2 “No Payment Methods” in Admin New OrderHow to integrate Paypal Express Checkout with the Magento APIMagento 1.5 - Sales > Order > edit order and shipping methods disappearAuto Invoice Check/Money Order Payment methodAdd more simple payment methods?Shipping methods not showingWhat should I do to change payment methods if changing the configuration has no effects?1.9 - No Payment Methods showing upMy Payment Methods not Showing for downloadable/virtual product when checkout?Magento2 API to access internal payment methodHow to call an existing payment methods in the registration form?