Why is the derivative of the activation functions in neural networks important?Why do activation functions need to be differentiable in the context of neural networks?How do neural networks manage to do regression?Mathematical intuition for the use of Re-Lu's in Machine LearningWhich functions can be activation functions?Target values of 0.1 for 0 and 0.9 for 1 for sigmoidWeight Normalization paperWhy exactly do neural networks require i.i.d. data?What kind of functions can be used as activation functions?Why can't neural networks learn functions outside of the specified domains?Effect of rescaling of inputs on loss for a simple neural network

Why was Rhodes following Barton?

Why did the population of Bhutan drop by 70% between 2007 and 2008?

Why does Windows store Wi-Fi passwords in a reversible format?

Why can't UK MPs vote for the Withdrawal Agreement, then renege on the backstop if it comes to that?

Does the Reduce option from the Enlarge/Reduce spell cause a critical hit to do 2d4 less damage?

How do solar inverter systems easily add AC power sources together?

Finding square root without division and initial guess

Is it true that different variants of the same model aircraft don't require pilot retraining?

Defending Castle from Zombies

How do I insert two edge loops equally spaced from the edges?

Will removing shelving screws from studs damage the studs?

Time difference between banns and marriage

rationalizing sieges in a modern/near-future setting

Count the number of shortest paths to n

Why did Lucius make a deal out of Buckbeak hurting Draco but not about Draco being turned into a ferret?

Force SQL Server to use fragmented indexes?

How many petaflops does it take to land on the moon? What does Artemis need with an Aitken?

Should I use the words "pyromancy" and "necromancy" even if they don't mean what people think they do?

Do sharpies or markers damage soft gear?

A probably wrong proof of the Riemann Hypothesis, but where is the mistake?

How to force GCC to assume that a floating-point expression is non-negative?

How to emphasise the insignificance of someone/thing – besides using "klein"

Is the Amazon rainforest the "world's lungs"?

What is Soda Fountain Etiquette?



Why is the derivative of the activation functions in neural networks important?


Why do activation functions need to be differentiable in the context of neural networks?How do neural networks manage to do regression?Mathematical intuition for the use of Re-Lu's in Machine LearningWhich functions can be activation functions?Target values of 0.1 for 0 and 0.9 for 1 for sigmoidWeight Normalization paperWhy exactly do neural networks require i.i.d. data?What kind of functions can be used as activation functions?Why can't neural networks learn functions outside of the specified domains?Effect of rescaling of inputs on loss for a simple neural network






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








6












$begingroup$


I'm new to NN. I am trying to understand some of its foundations. One question that I have is: why the derivative of an activation function is important (not the function itself), and why it's the derivative which is tied to how the network performs learning?



For instance, when we say a constant derivative isn't good for learning, what is the intuition behind that? Is the activation function somehow like a hash function that needs to well differentiate small variance in inputs?










share|improve this question











$endgroup$













  • $begingroup$
    Can you cite some sources so that we can get a much more detailed picture?
    $endgroup$
    – DuttaA
    Aug 14 at 23:15










  • $begingroup$
    towardsdatascience.com/…
    $endgroup$
    – Tina J
    Aug 14 at 23:27










  • $begingroup$
    The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
    $endgroup$
    – DuttaA
    Aug 14 at 23:42


















6












$begingroup$


I'm new to NN. I am trying to understand some of its foundations. One question that I have is: why the derivative of an activation function is important (not the function itself), and why it's the derivative which is tied to how the network performs learning?



For instance, when we say a constant derivative isn't good for learning, what is the intuition behind that? Is the activation function somehow like a hash function that needs to well differentiate small variance in inputs?










share|improve this question











$endgroup$













  • $begingroup$
    Can you cite some sources so that we can get a much more detailed picture?
    $endgroup$
    – DuttaA
    Aug 14 at 23:15










  • $begingroup$
    towardsdatascience.com/…
    $endgroup$
    – Tina J
    Aug 14 at 23:27










  • $begingroup$
    The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
    $endgroup$
    – DuttaA
    Aug 14 at 23:42














6












6








6


4



$begingroup$


I'm new to NN. I am trying to understand some of its foundations. One question that I have is: why the derivative of an activation function is important (not the function itself), and why it's the derivative which is tied to how the network performs learning?



For instance, when we say a constant derivative isn't good for learning, what is the intuition behind that? Is the activation function somehow like a hash function that needs to well differentiate small variance in inputs?










share|improve this question











$endgroup$




I'm new to NN. I am trying to understand some of its foundations. One question that I have is: why the derivative of an activation function is important (not the function itself), and why it's the derivative which is tied to how the network performs learning?



For instance, when we say a constant derivative isn't good for learning, what is the intuition behind that? Is the activation function somehow like a hash function that needs to well differentiate small variance in inputs?







neural-networks machine-learning deep-learning math activation-function






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 15 at 0:06









nbro

6,5584 gold badges16 silver badges36 bronze badges




6,5584 gold badges16 silver badges36 bronze badges










asked Aug 14 at 22:30









Tina JTina J

3071 silver badge9 bronze badges




3071 silver badge9 bronze badges














  • $begingroup$
    Can you cite some sources so that we can get a much more detailed picture?
    $endgroup$
    – DuttaA
    Aug 14 at 23:15










  • $begingroup$
    towardsdatascience.com/…
    $endgroup$
    – Tina J
    Aug 14 at 23:27










  • $begingroup$
    The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
    $endgroup$
    – DuttaA
    Aug 14 at 23:42

















  • $begingroup$
    Can you cite some sources so that we can get a much more detailed picture?
    $endgroup$
    – DuttaA
    Aug 14 at 23:15










  • $begingroup$
    towardsdatascience.com/…
    $endgroup$
    – Tina J
    Aug 14 at 23:27










  • $begingroup$
    The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
    $endgroup$
    – DuttaA
    Aug 14 at 23:42
















$begingroup$
Can you cite some sources so that we can get a much more detailed picture?
$endgroup$
– DuttaA
Aug 14 at 23:15




$begingroup$
Can you cite some sources so that we can get a much more detailed picture?
$endgroup$
– DuttaA
Aug 14 at 23:15












$begingroup$
towardsdatascience.com/…
$endgroup$
– Tina J
Aug 14 at 23:27




$begingroup$
towardsdatascience.com/…
$endgroup$
– Tina J
Aug 14 at 23:27












$begingroup$
The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
$endgroup$
– DuttaA
Aug 14 at 23:42





$begingroup$
The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
$endgroup$
– DuttaA
Aug 14 at 23:42











3 Answers
3






active

oldest

votes


















6













$begingroup$

If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:



Recall that the derivative tells you a function's sensitivity to change with respect to a change in its input. A high (absolute) value for the derivative at a certain point means that the function is very steep, and a small change in input may result in a drastic change in its output; conversely, a low absolute value means little change, so not steep at all, with the extreme case that the function is constant when the derivative is zero.



Training a neural network essentially amounts to an optimization problem where one wants to minimize a certain value, in this case the error produced by the network on the given training examples. Backpropagation learning can be viewed as a case of gradient descent (the inverse of hill climbing).



If for a moment we assume that your input is only 2-dimensional (just for illustration, the mathematics of course also work for higher dimensions), you could imagine the error function as a landscape with hills, mountains, valleys, ridges etc. You are standing at a high point and want to get down as far as possible. Gradient descent means that, in discrete steps, you always walk down in the direction that has the steepest slope downwards from where you are currently standing, until you eventually reach a (local) minimum.



In order to determine where that steepest slope is, you need the derivative of the activation function. Basically, you want to sort out how much each unit in your network contributes to an error, and adjust in the direction that contributes the most.



Edit: Regarding constant values for a derivative, in the landscape metaphor it would mean that the gradient is the same no matter where you are, so you'll always go in the same direction and never reach an optimum. However, multi-layer networks with linear activation function are kind of besides the point anyhow when you consider that each cell computes a linear combination of its inputs, which then is again a linear function, so the output of the last layer will ultimately be a linear function of the inputs at the first layer. That is to say, anything you can do with a multi-layer net with linear activation functions, you could also achieve with just a single layer.






share|improve this answer











$endgroup$














  • $begingroup$
    Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
    $endgroup$
    – Tina J
    Aug 15 at 1:43






  • 1




    $begingroup$
    @Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
    $endgroup$
    – Jens Classen
    Aug 15 at 2:09


















4













$begingroup$

The basic (and usual) algorithm used to update the weights of the artificial neural network (ANN) is an iterative, numerical and optimization algorithm, called gradient descent, which is based on and requires the computation of the derivative of the function you want to find the minimum of. If the function you want to find the minimum of is multivariable, then, rather than the derivative, gradient descent requires the gradient, which is a vector where the $i$th element contains the partial derivative of the function with respect to the $i$th variable. Hence the name gradient descent, where the derivative of a function of one variable can be considered the gradient of the function.



In the case of ANNs, we usually have a loss function that we want to minimize: for example, the mean squared error (MSE). Therefore, in order to apply gradient descent to find the minimum of the MSE, we need to find the derivative or, more precisely, the gradient of the MSE. To do it, the back-propagation (an algorithm based on the chain rule) is often used, given that the MSE is a function of the ANN, which is a composite function of multiple non-linear functions, the activation functions, whose main purpose is thus to introduce non-linearity, or, in other words, it makes the ANN powerful. Given that the MSE is a function of the parameters of the ANN, then we need to find the partial derivative of the MSE with respect to all parameters of the ANN. In this process, we will also need to find the derivatives of the activation functions that each neuron applies to its linear combination of weights: to fully see this, you will need to learn the details of back-propagation! Hence the importance of the derivatives of the activation functions.



A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.



To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!






share|improve this answer











$endgroup$










  • 1




    $begingroup$
    Comments are not for extended discussion; this conversation has been moved to chat.
    $endgroup$
    – Ben N
    Aug 15 at 19:54


















3













$begingroup$

Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$



The goal is to fit a function that best explains our dataset.We can fit a simple function, as we do in linear regression. But that's different about neural networks, where we fit a complex function, say:



$beginalignh(x) & = h(x_1,x_2,x_3)\
& =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$



where, $theta = w_14,w_24,w_34,b_4,w_15,w_25,w_35,b_5,w_46,w_56,b_6$ is the set of the respective coefficients we have to determine such that we minimize:
$$J(theta) = frac12sum_i=1^N (y^(i)-h(x^(i)))^2$$
The above optimization problem can be easily solved with gradient descent. Just initiate $theta$ with random values and with proper learning parameter $eta$, update as follows till convergence:
$$theta:=theta-etafracpartial Jpartial theta$$



In order to get the gradients, we express the above function as a neural network as follows:
enter image description here



Let's calculate the gradient, say w.r.t. $w_14$.
$$fracpartial Jpartial w_14 = sum_i=1^N Big[big(h(x^(i))-y^(i)big)fracpartial h(x^(i))partial w_14Big]$$
Let $p(x) = w_14x_1+w_24x_2+w_34x_3+b_4$ , and

Let $q(x) = w_46timessigma(p(x))+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)$



$therefore fracpartial h(x)partial w_14 = fracpartial h(x)partial q(x)timesfracpartial q(x)partial p(x)timesfracpartial p(x)partial w_14 = fracpartialsigma(q(x))partial q(x)timesfracpartialsigma(p(x))partial p(x)timesfracpartial p(x)partial w_14$



We see that the derivative of the activation function is important for getting the gradients and so for the learning of the neural network. A constant derivative will not help in the gradient descent and we won't be able to learn the optimal parameters.






share|improve this answer










New contributor



babkr is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





$endgroup$

















    Your Answer








    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "658"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    noCode: true, onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f13978%2fwhy-is-the-derivative-of-the-activation-functions-in-neural-networks-important%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    6













    $begingroup$

    If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:



    Recall that the derivative tells you a function's sensitivity to change with respect to a change in its input. A high (absolute) value for the derivative at a certain point means that the function is very steep, and a small change in input may result in a drastic change in its output; conversely, a low absolute value means little change, so not steep at all, with the extreme case that the function is constant when the derivative is zero.



    Training a neural network essentially amounts to an optimization problem where one wants to minimize a certain value, in this case the error produced by the network on the given training examples. Backpropagation learning can be viewed as a case of gradient descent (the inverse of hill climbing).



    If for a moment we assume that your input is only 2-dimensional (just for illustration, the mathematics of course also work for higher dimensions), you could imagine the error function as a landscape with hills, mountains, valleys, ridges etc. You are standing at a high point and want to get down as far as possible. Gradient descent means that, in discrete steps, you always walk down in the direction that has the steepest slope downwards from where you are currently standing, until you eventually reach a (local) minimum.



    In order to determine where that steepest slope is, you need the derivative of the activation function. Basically, you want to sort out how much each unit in your network contributes to an error, and adjust in the direction that contributes the most.



    Edit: Regarding constant values for a derivative, in the landscape metaphor it would mean that the gradient is the same no matter where you are, so you'll always go in the same direction and never reach an optimum. However, multi-layer networks with linear activation function are kind of besides the point anyhow when you consider that each cell computes a linear combination of its inputs, which then is again a linear function, so the output of the last layer will ultimately be a linear function of the inputs at the first layer. That is to say, anything you can do with a multi-layer net with linear activation functions, you could also achieve with just a single layer.






    share|improve this answer











    $endgroup$














    • $begingroup$
      Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
      $endgroup$
      – Tina J
      Aug 15 at 1:43






    • 1




      $begingroup$
      @Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
      $endgroup$
      – Jens Classen
      Aug 15 at 2:09















    6













    $begingroup$

    If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:



    Recall that the derivative tells you a function's sensitivity to change with respect to a change in its input. A high (absolute) value for the derivative at a certain point means that the function is very steep, and a small change in input may result in a drastic change in its output; conversely, a low absolute value means little change, so not steep at all, with the extreme case that the function is constant when the derivative is zero.



    Training a neural network essentially amounts to an optimization problem where one wants to minimize a certain value, in this case the error produced by the network on the given training examples. Backpropagation learning can be viewed as a case of gradient descent (the inverse of hill climbing).



    If for a moment we assume that your input is only 2-dimensional (just for illustration, the mathematics of course also work for higher dimensions), you could imagine the error function as a landscape with hills, mountains, valleys, ridges etc. You are standing at a high point and want to get down as far as possible. Gradient descent means that, in discrete steps, you always walk down in the direction that has the steepest slope downwards from where you are currently standing, until you eventually reach a (local) minimum.



    In order to determine where that steepest slope is, you need the derivative of the activation function. Basically, you want to sort out how much each unit in your network contributes to an error, and adjust in the direction that contributes the most.



    Edit: Regarding constant values for a derivative, in the landscape metaphor it would mean that the gradient is the same no matter where you are, so you'll always go in the same direction and never reach an optimum. However, multi-layer networks with linear activation function are kind of besides the point anyhow when you consider that each cell computes a linear combination of its inputs, which then is again a linear function, so the output of the last layer will ultimately be a linear function of the inputs at the first layer. That is to say, anything you can do with a multi-layer net with linear activation functions, you could also achieve with just a single layer.






    share|improve this answer











    $endgroup$














    • $begingroup$
      Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
      $endgroup$
      – Tina J
      Aug 15 at 1:43






    • 1




      $begingroup$
      @Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
      $endgroup$
      – Jens Classen
      Aug 15 at 2:09













    6














    6










    6







    $begingroup$

    If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:



    Recall that the derivative tells you a function's sensitivity to change with respect to a change in its input. A high (absolute) value for the derivative at a certain point means that the function is very steep, and a small change in input may result in a drastic change in its output; conversely, a low absolute value means little change, so not steep at all, with the extreme case that the function is constant when the derivative is zero.



    Training a neural network essentially amounts to an optimization problem where one wants to minimize a certain value, in this case the error produced by the network on the given training examples. Backpropagation learning can be viewed as a case of gradient descent (the inverse of hill climbing).



    If for a moment we assume that your input is only 2-dimensional (just for illustration, the mathematics of course also work for higher dimensions), you could imagine the error function as a landscape with hills, mountains, valleys, ridges etc. You are standing at a high point and want to get down as far as possible. Gradient descent means that, in discrete steps, you always walk down in the direction that has the steepest slope downwards from where you are currently standing, until you eventually reach a (local) minimum.



    In order to determine where that steepest slope is, you need the derivative of the activation function. Basically, you want to sort out how much each unit in your network contributes to an error, and adjust in the direction that contributes the most.



    Edit: Regarding constant values for a derivative, in the landscape metaphor it would mean that the gradient is the same no matter where you are, so you'll always go in the same direction and never reach an optimum. However, multi-layer networks with linear activation function are kind of besides the point anyhow when you consider that each cell computes a linear combination of its inputs, which then is again a linear function, so the output of the last layer will ultimately be a linear function of the inputs at the first layer. That is to say, anything you can do with a multi-layer net with linear activation functions, you could also achieve with just a single layer.






    share|improve this answer











    $endgroup$



    If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:



    Recall that the derivative tells you a function's sensitivity to change with respect to a change in its input. A high (absolute) value for the derivative at a certain point means that the function is very steep, and a small change in input may result in a drastic change in its output; conversely, a low absolute value means little change, so not steep at all, with the extreme case that the function is constant when the derivative is zero.



    Training a neural network essentially amounts to an optimization problem where one wants to minimize a certain value, in this case the error produced by the network on the given training examples. Backpropagation learning can be viewed as a case of gradient descent (the inverse of hill climbing).



    If for a moment we assume that your input is only 2-dimensional (just for illustration, the mathematics of course also work for higher dimensions), you could imagine the error function as a landscape with hills, mountains, valleys, ridges etc. You are standing at a high point and want to get down as far as possible. Gradient descent means that, in discrete steps, you always walk down in the direction that has the steepest slope downwards from where you are currently standing, until you eventually reach a (local) minimum.



    In order to determine where that steepest slope is, you need the derivative of the activation function. Basically, you want to sort out how much each unit in your network contributes to an error, and adjust in the direction that contributes the most.



    Edit: Regarding constant values for a derivative, in the landscape metaphor it would mean that the gradient is the same no matter where you are, so you'll always go in the same direction and never reach an optimum. However, multi-layer networks with linear activation function are kind of besides the point anyhow when you consider that each cell computes a linear combination of its inputs, which then is again a linear function, so the output of the last layer will ultimately be a linear function of the inputs at the first layer. That is to say, anything you can do with a multi-layer net with linear activation functions, you could also achieve with just a single layer.







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Aug 15 at 1:01

























    answered Aug 15 at 0:39









    Jens ClassenJens Classen

    1416 bronze badges




    1416 bronze badges














    • $begingroup$
      Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
      $endgroup$
      – Tina J
      Aug 15 at 1:43






    • 1




      $begingroup$
      @Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
      $endgroup$
      – Jens Classen
      Aug 15 at 2:09
















    • $begingroup$
      Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
      $endgroup$
      – Tina J
      Aug 15 at 1:43






    • 1




      $begingroup$
      @Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
      $endgroup$
      – Jens Classen
      Aug 15 at 2:09















    $begingroup$
    Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
    $endgroup$
    – Tina J
    Aug 15 at 1:43




    $begingroup$
    Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
    $endgroup$
    – Tina J
    Aug 15 at 1:43




    1




    1




    $begingroup$
    @Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
    $endgroup$
    – Jens Classen
    Aug 15 at 2:09




    $begingroup$
    @Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
    $endgroup$
    – Jens Classen
    Aug 15 at 2:09













    4













    $begingroup$

    The basic (and usual) algorithm used to update the weights of the artificial neural network (ANN) is an iterative, numerical and optimization algorithm, called gradient descent, which is based on and requires the computation of the derivative of the function you want to find the minimum of. If the function you want to find the minimum of is multivariable, then, rather than the derivative, gradient descent requires the gradient, which is a vector where the $i$th element contains the partial derivative of the function with respect to the $i$th variable. Hence the name gradient descent, where the derivative of a function of one variable can be considered the gradient of the function.



    In the case of ANNs, we usually have a loss function that we want to minimize: for example, the mean squared error (MSE). Therefore, in order to apply gradient descent to find the minimum of the MSE, we need to find the derivative or, more precisely, the gradient of the MSE. To do it, the back-propagation (an algorithm based on the chain rule) is often used, given that the MSE is a function of the ANN, which is a composite function of multiple non-linear functions, the activation functions, whose main purpose is thus to introduce non-linearity, or, in other words, it makes the ANN powerful. Given that the MSE is a function of the parameters of the ANN, then we need to find the partial derivative of the MSE with respect to all parameters of the ANN. In this process, we will also need to find the derivatives of the activation functions that each neuron applies to its linear combination of weights: to fully see this, you will need to learn the details of back-propagation! Hence the importance of the derivatives of the activation functions.



    A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.



    To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!






    share|improve this answer











    $endgroup$










    • 1




      $begingroup$
      Comments are not for extended discussion; this conversation has been moved to chat.
      $endgroup$
      – Ben N
      Aug 15 at 19:54















    4













    $begingroup$

    The basic (and usual) algorithm used to update the weights of the artificial neural network (ANN) is an iterative, numerical and optimization algorithm, called gradient descent, which is based on and requires the computation of the derivative of the function you want to find the minimum of. If the function you want to find the minimum of is multivariable, then, rather than the derivative, gradient descent requires the gradient, which is a vector where the $i$th element contains the partial derivative of the function with respect to the $i$th variable. Hence the name gradient descent, where the derivative of a function of one variable can be considered the gradient of the function.



    In the case of ANNs, we usually have a loss function that we want to minimize: for example, the mean squared error (MSE). Therefore, in order to apply gradient descent to find the minimum of the MSE, we need to find the derivative or, more precisely, the gradient of the MSE. To do it, the back-propagation (an algorithm based on the chain rule) is often used, given that the MSE is a function of the ANN, which is a composite function of multiple non-linear functions, the activation functions, whose main purpose is thus to introduce non-linearity, or, in other words, it makes the ANN powerful. Given that the MSE is a function of the parameters of the ANN, then we need to find the partial derivative of the MSE with respect to all parameters of the ANN. In this process, we will also need to find the derivatives of the activation functions that each neuron applies to its linear combination of weights: to fully see this, you will need to learn the details of back-propagation! Hence the importance of the derivatives of the activation functions.



    A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.



    To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!






    share|improve this answer











    $endgroup$










    • 1




      $begingroup$
      Comments are not for extended discussion; this conversation has been moved to chat.
      $endgroup$
      – Ben N
      Aug 15 at 19:54













    4














    4










    4







    $begingroup$

    The basic (and usual) algorithm used to update the weights of the artificial neural network (ANN) is an iterative, numerical and optimization algorithm, called gradient descent, which is based on and requires the computation of the derivative of the function you want to find the minimum of. If the function you want to find the minimum of is multivariable, then, rather than the derivative, gradient descent requires the gradient, which is a vector where the $i$th element contains the partial derivative of the function with respect to the $i$th variable. Hence the name gradient descent, where the derivative of a function of one variable can be considered the gradient of the function.



    In the case of ANNs, we usually have a loss function that we want to minimize: for example, the mean squared error (MSE). Therefore, in order to apply gradient descent to find the minimum of the MSE, we need to find the derivative or, more precisely, the gradient of the MSE. To do it, the back-propagation (an algorithm based on the chain rule) is often used, given that the MSE is a function of the ANN, which is a composite function of multiple non-linear functions, the activation functions, whose main purpose is thus to introduce non-linearity, or, in other words, it makes the ANN powerful. Given that the MSE is a function of the parameters of the ANN, then we need to find the partial derivative of the MSE with respect to all parameters of the ANN. In this process, we will also need to find the derivatives of the activation functions that each neuron applies to its linear combination of weights: to fully see this, you will need to learn the details of back-propagation! Hence the importance of the derivatives of the activation functions.



    A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.



    To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!






    share|improve this answer











    $endgroup$



    The basic (and usual) algorithm used to update the weights of the artificial neural network (ANN) is an iterative, numerical and optimization algorithm, called gradient descent, which is based on and requires the computation of the derivative of the function you want to find the minimum of. If the function you want to find the minimum of is multivariable, then, rather than the derivative, gradient descent requires the gradient, which is a vector where the $i$th element contains the partial derivative of the function with respect to the $i$th variable. Hence the name gradient descent, where the derivative of a function of one variable can be considered the gradient of the function.



    In the case of ANNs, we usually have a loss function that we want to minimize: for example, the mean squared error (MSE). Therefore, in order to apply gradient descent to find the minimum of the MSE, we need to find the derivative or, more precisely, the gradient of the MSE. To do it, the back-propagation (an algorithm based on the chain rule) is often used, given that the MSE is a function of the ANN, which is a composite function of multiple non-linear functions, the activation functions, whose main purpose is thus to introduce non-linearity, or, in other words, it makes the ANN powerful. Given that the MSE is a function of the parameters of the ANN, then we need to find the partial derivative of the MSE with respect to all parameters of the ANN. In this process, we will also need to find the derivatives of the activation functions that each neuron applies to its linear combination of weights: to fully see this, you will need to learn the details of back-propagation! Hence the importance of the derivatives of the activation functions.



    A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.



    To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!







    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Aug 15 at 1:29

























    answered Aug 15 at 0:43









    nbronbro

    6,5584 gold badges16 silver badges36 bronze badges




    6,5584 gold badges16 silver badges36 bronze badges










    • 1




      $begingroup$
      Comments are not for extended discussion; this conversation has been moved to chat.
      $endgroup$
      – Ben N
      Aug 15 at 19:54












    • 1




      $begingroup$
      Comments are not for extended discussion; this conversation has been moved to chat.
      $endgroup$
      – Ben N
      Aug 15 at 19:54







    1




    1




    $begingroup$
    Comments are not for extended discussion; this conversation has been moved to chat.
    $endgroup$
    – Ben N
    Aug 15 at 19:54




    $begingroup$
    Comments are not for extended discussion; this conversation has been moved to chat.
    $endgroup$
    – Ben N
    Aug 15 at 19:54











    3













    $begingroup$

    Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$



    The goal is to fit a function that best explains our dataset.We can fit a simple function, as we do in linear regression. But that's different about neural networks, where we fit a complex function, say:



    $beginalignh(x) & = h(x_1,x_2,x_3)\
    & =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$



    where, $theta = w_14,w_24,w_34,b_4,w_15,w_25,w_35,b_5,w_46,w_56,b_6$ is the set of the respective coefficients we have to determine such that we minimize:
    $$J(theta) = frac12sum_i=1^N (y^(i)-h(x^(i)))^2$$
    The above optimization problem can be easily solved with gradient descent. Just initiate $theta$ with random values and with proper learning parameter $eta$, update as follows till convergence:
    $$theta:=theta-etafracpartial Jpartial theta$$



    In order to get the gradients, we express the above function as a neural network as follows:
    enter image description here



    Let's calculate the gradient, say w.r.t. $w_14$.
    $$fracpartial Jpartial w_14 = sum_i=1^N Big[big(h(x^(i))-y^(i)big)fracpartial h(x^(i))partial w_14Big]$$
    Let $p(x) = w_14x_1+w_24x_2+w_34x_3+b_4$ , and

    Let $q(x) = w_46timessigma(p(x))+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)$



    $therefore fracpartial h(x)partial w_14 = fracpartial h(x)partial q(x)timesfracpartial q(x)partial p(x)timesfracpartial p(x)partial w_14 = fracpartialsigma(q(x))partial q(x)timesfracpartialsigma(p(x))partial p(x)timesfracpartial p(x)partial w_14$



    We see that the derivative of the activation function is important for getting the gradients and so for the learning of the neural network. A constant derivative will not help in the gradient descent and we won't be able to learn the optimal parameters.






    share|improve this answer










    New contributor



    babkr is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
    Check out our Code of Conduct.





    $endgroup$



















      3













      $begingroup$

      Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$



      The goal is to fit a function that best explains our dataset.We can fit a simple function, as we do in linear regression. But that's different about neural networks, where we fit a complex function, say:



      $beginalignh(x) & = h(x_1,x_2,x_3)\
      & =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$



      where, $theta = w_14,w_24,w_34,b_4,w_15,w_25,w_35,b_5,w_46,w_56,b_6$ is the set of the respective coefficients we have to determine such that we minimize:
      $$J(theta) = frac12sum_i=1^N (y^(i)-h(x^(i)))^2$$
      The above optimization problem can be easily solved with gradient descent. Just initiate $theta$ with random values and with proper learning parameter $eta$, update as follows till convergence:
      $$theta:=theta-etafracpartial Jpartial theta$$



      In order to get the gradients, we express the above function as a neural network as follows:
      enter image description here



      Let's calculate the gradient, say w.r.t. $w_14$.
      $$fracpartial Jpartial w_14 = sum_i=1^N Big[big(h(x^(i))-y^(i)big)fracpartial h(x^(i))partial w_14Big]$$
      Let $p(x) = w_14x_1+w_24x_2+w_34x_3+b_4$ , and

      Let $q(x) = w_46timessigma(p(x))+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)$



      $therefore fracpartial h(x)partial w_14 = fracpartial h(x)partial q(x)timesfracpartial q(x)partial p(x)timesfracpartial p(x)partial w_14 = fracpartialsigma(q(x))partial q(x)timesfracpartialsigma(p(x))partial p(x)timesfracpartial p(x)partial w_14$



      We see that the derivative of the activation function is important for getting the gradients and so for the learning of the neural network. A constant derivative will not help in the gradient descent and we won't be able to learn the optimal parameters.






      share|improve this answer










      New contributor



      babkr is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.





      $endgroup$

















        3














        3










        3







        $begingroup$

        Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$



        The goal is to fit a function that best explains our dataset.We can fit a simple function, as we do in linear regression. But that's different about neural networks, where we fit a complex function, say:



        $beginalignh(x) & = h(x_1,x_2,x_3)\
        & =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$



        where, $theta = w_14,w_24,w_34,b_4,w_15,w_25,w_35,b_5,w_46,w_56,b_6$ is the set of the respective coefficients we have to determine such that we minimize:
        $$J(theta) = frac12sum_i=1^N (y^(i)-h(x^(i)))^2$$
        The above optimization problem can be easily solved with gradient descent. Just initiate $theta$ with random values and with proper learning parameter $eta$, update as follows till convergence:
        $$theta:=theta-etafracpartial Jpartial theta$$



        In order to get the gradients, we express the above function as a neural network as follows:
        enter image description here



        Let's calculate the gradient, say w.r.t. $w_14$.
        $$fracpartial Jpartial w_14 = sum_i=1^N Big[big(h(x^(i))-y^(i)big)fracpartial h(x^(i))partial w_14Big]$$
        Let $p(x) = w_14x_1+w_24x_2+w_34x_3+b_4$ , and

        Let $q(x) = w_46timessigma(p(x))+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)$



        $therefore fracpartial h(x)partial w_14 = fracpartial h(x)partial q(x)timesfracpartial q(x)partial p(x)timesfracpartial p(x)partial w_14 = fracpartialsigma(q(x))partial q(x)timesfracpartialsigma(p(x))partial p(x)timesfracpartial p(x)partial w_14$



        We see that the derivative of the activation function is important for getting the gradients and so for the learning of the neural network. A constant derivative will not help in the gradient descent and we won't be able to learn the optimal parameters.






        share|improve this answer










        New contributor



        babkr is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.





        $endgroup$



        Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$



        The goal is to fit a function that best explains our dataset.We can fit a simple function, as we do in linear regression. But that's different about neural networks, where we fit a complex function, say:



        $beginalignh(x) & = h(x_1,x_2,x_3)\
        & =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$



        where, $theta = w_14,w_24,w_34,b_4,w_15,w_25,w_35,b_5,w_46,w_56,b_6$ is the set of the respective coefficients we have to determine such that we minimize:
        $$J(theta) = frac12sum_i=1^N (y^(i)-h(x^(i)))^2$$
        The above optimization problem can be easily solved with gradient descent. Just initiate $theta$ with random values and with proper learning parameter $eta$, update as follows till convergence:
        $$theta:=theta-etafracpartial Jpartial theta$$



        In order to get the gradients, we express the above function as a neural network as follows:
        enter image description here



        Let's calculate the gradient, say w.r.t. $w_14$.
        $$fracpartial Jpartial w_14 = sum_i=1^N Big[big(h(x^(i))-y^(i)big)fracpartial h(x^(i))partial w_14Big]$$
        Let $p(x) = w_14x_1+w_24x_2+w_34x_3+b_4$ , and

        Let $q(x) = w_46timessigma(p(x))+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)$



        $therefore fracpartial h(x)partial w_14 = fracpartial h(x)partial q(x)timesfracpartial q(x)partial p(x)timesfracpartial p(x)partial w_14 = fracpartialsigma(q(x))partial q(x)timesfracpartialsigma(p(x))partial p(x)timesfracpartial p(x)partial w_14$



        We see that the derivative of the activation function is important for getting the gradients and so for the learning of the neural network. A constant derivative will not help in the gradient descent and we won't be able to learn the optimal parameters.







        share|improve this answer










        New contributor



        babkr is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.








        share|improve this answer



        share|improve this answer








        edited Aug 24 at 5:08





















        New contributor



        babkr is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.








        answered Aug 23 at 19:56









        babkrbabkr

        313 bronze badges




        313 bronze badges




        New contributor



        babkr is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.




        New contributor




        babkr is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.
































            draft saved

            draft discarded
















































            Thanks for contributing an answer to Artificial Intelligence Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f13978%2fwhy-is-the-derivative-of-the-activation-functions-in-neural-networks-important%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

            Circuit construction for execution of conditional statements using least significant bitHow are two different registers being used as “control”?How exactly is the stated composite state of the two registers being produced using the $R_zz$ controlled rotations?Efficiently performing controlled rotations in HHLWould this quantum algorithm implementation work?How to prepare a superposed states of odd integers from $1$ to $sqrtN$?Why is this implementation of the order finding algorithm not working?Circuit construction for Hamiltonian simulationHow can I invert the least significant bit of a certain term of a superposed state?Implementing an oracleImplementing a controlled sum operation

            Magento 2 “No Payment Methods” in Admin New OrderHow to integrate Paypal Express Checkout with the Magento APIMagento 1.5 - Sales > Order > edit order and shipping methods disappearAuto Invoice Check/Money Order Payment methodAdd more simple payment methods?Shipping methods not showingWhat should I do to change payment methods if changing the configuration has no effects?1.9 - No Payment Methods showing upMy Payment Methods not Showing for downloadable/virtual product when checkout?Magento2 API to access internal payment methodHow to call an existing payment methods in the registration form?