Why is the derivative of the activation functions in neural networks important?Why do activation functions need to be differentiable in the context of neural networks?How do neural networks manage to do regression?Mathematical intuition for the use of Re-Lu's in Machine LearningWhich functions can be activation functions?Target values of 0.1 for 0 and 0.9 for 1 for sigmoidWeight Normalization paperWhy exactly do neural networks require i.i.d. data?What kind of functions can be used as activation functions?Why can't neural networks learn functions outside of the specified domains?Effect of rescaling of inputs on loss for a simple neural network
Why was Rhodes following Barton?
Why did the population of Bhutan drop by 70% between 2007 and 2008?
Why does Windows store Wi-Fi passwords in a reversible format?
Why can't UK MPs vote for the Withdrawal Agreement, then renege on the backstop if it comes to that?
Does the Reduce option from the Enlarge/Reduce spell cause a critical hit to do 2d4 less damage?
How do solar inverter systems easily add AC power sources together?
Finding square root without division and initial guess
Is it true that different variants of the same model aircraft don't require pilot retraining?
Defending Castle from Zombies
How do I insert two edge loops equally spaced from the edges?
Will removing shelving screws from studs damage the studs?
Time difference between banns and marriage
rationalizing sieges in a modern/near-future setting
Count the number of shortest paths to n
Why did Lucius make a deal out of Buckbeak hurting Draco but not about Draco being turned into a ferret?
Force SQL Server to use fragmented indexes?
How many petaflops does it take to land on the moon? What does Artemis need with an Aitken?
Should I use the words "pyromancy" and "necromancy" even if they don't mean what people think they do?
Do sharpies or markers damage soft gear?
A probably wrong proof of the Riemann Hypothesis, but where is the mistake?
How to force GCC to assume that a floating-point expression is non-negative?
How to emphasise the insignificance of someone/thing – besides using "klein"
Is the Amazon rainforest the "world's lungs"?
What is Soda Fountain Etiquette?
Why is the derivative of the activation functions in neural networks important?
Why do activation functions need to be differentiable in the context of neural networks?How do neural networks manage to do regression?Mathematical intuition for the use of Re-Lu's in Machine LearningWhich functions can be activation functions?Target values of 0.1 for 0 and 0.9 for 1 for sigmoidWeight Normalization paperWhy exactly do neural networks require i.i.d. data?What kind of functions can be used as activation functions?Why can't neural networks learn functions outside of the specified domains?Effect of rescaling of inputs on loss for a simple neural network
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
I'm new to NN. I am trying to understand some of its foundations. One question that I have is: why the derivative of an activation function is important (not the function itself), and why it's the derivative which is tied to how the network performs learning?
For instance, when we say a constant derivative isn't good for learning, what is the intuition behind that? Is the activation function somehow like a hash function that needs to well differentiate small variance in inputs?
neural-networks machine-learning deep-learning math activation-function
$endgroup$
add a comment |
$begingroup$
I'm new to NN. I am trying to understand some of its foundations. One question that I have is: why the derivative of an activation function is important (not the function itself), and why it's the derivative which is tied to how the network performs learning?
For instance, when we say a constant derivative isn't good for learning, what is the intuition behind that? Is the activation function somehow like a hash function that needs to well differentiate small variance in inputs?
neural-networks machine-learning deep-learning math activation-function
$endgroup$
$begingroup$
Can you cite some sources so that we can get a much more detailed picture?
$endgroup$
– DuttaA
Aug 14 at 23:15
$begingroup$
towardsdatascience.com/…
$endgroup$
– Tina J
Aug 14 at 23:27
$begingroup$
The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
$endgroup$
– DuttaA
Aug 14 at 23:42
add a comment |
$begingroup$
I'm new to NN. I am trying to understand some of its foundations. One question that I have is: why the derivative of an activation function is important (not the function itself), and why it's the derivative which is tied to how the network performs learning?
For instance, when we say a constant derivative isn't good for learning, what is the intuition behind that? Is the activation function somehow like a hash function that needs to well differentiate small variance in inputs?
neural-networks machine-learning deep-learning math activation-function
$endgroup$
I'm new to NN. I am trying to understand some of its foundations. One question that I have is: why the derivative of an activation function is important (not the function itself), and why it's the derivative which is tied to how the network performs learning?
For instance, when we say a constant derivative isn't good for learning, what is the intuition behind that? Is the activation function somehow like a hash function that needs to well differentiate small variance in inputs?
neural-networks machine-learning deep-learning math activation-function
neural-networks machine-learning deep-learning math activation-function
edited Aug 15 at 0:06
nbro
6,5584 gold badges16 silver badges36 bronze badges
6,5584 gold badges16 silver badges36 bronze badges
asked Aug 14 at 22:30
Tina JTina J
3071 silver badge9 bronze badges
3071 silver badge9 bronze badges
$begingroup$
Can you cite some sources so that we can get a much more detailed picture?
$endgroup$
– DuttaA
Aug 14 at 23:15
$begingroup$
towardsdatascience.com/…
$endgroup$
– Tina J
Aug 14 at 23:27
$begingroup$
The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
$endgroup$
– DuttaA
Aug 14 at 23:42
add a comment |
$begingroup$
Can you cite some sources so that we can get a much more detailed picture?
$endgroup$
– DuttaA
Aug 14 at 23:15
$begingroup$
towardsdatascience.com/…
$endgroup$
– Tina J
Aug 14 at 23:27
$begingroup$
The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
$endgroup$
– DuttaA
Aug 14 at 23:42
$begingroup$
Can you cite some sources so that we can get a much more detailed picture?
$endgroup$
– DuttaA
Aug 14 at 23:15
$begingroup$
Can you cite some sources so that we can get a much more detailed picture?
$endgroup$
– DuttaA
Aug 14 at 23:15
$begingroup$
towardsdatascience.com/…
$endgroup$
– Tina J
Aug 14 at 23:27
$begingroup$
towardsdatascience.com/…
$endgroup$
– Tina J
Aug 14 at 23:27
$begingroup$
The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
$endgroup$
– DuttaA
Aug 14 at 23:42
$begingroup$
The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
$endgroup$
– DuttaA
Aug 14 at 23:42
add a comment |
3 Answers
3
active
oldest
votes
$begingroup$
If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:
Recall that the derivative tells you a function's sensitivity to change with respect to a change in its input. A high (absolute) value for the derivative at a certain point means that the function is very steep, and a small change in input may result in a drastic change in its output; conversely, a low absolute value means little change, so not steep at all, with the extreme case that the function is constant when the derivative is zero.
Training a neural network essentially amounts to an optimization problem where one wants to minimize a certain value, in this case the error produced by the network on the given training examples. Backpropagation learning can be viewed as a case of gradient descent (the inverse of hill climbing).
If for a moment we assume that your input is only 2-dimensional (just for illustration, the mathematics of course also work for higher dimensions), you could imagine the error function as a landscape with hills, mountains, valleys, ridges etc. You are standing at a high point and want to get down as far as possible. Gradient descent means that, in discrete steps, you always walk down in the direction that has the steepest slope downwards from where you are currently standing, until you eventually reach a (local) minimum.
In order to determine where that steepest slope is, you need the derivative of the activation function. Basically, you want to sort out how much each unit in your network contributes to an error, and adjust in the direction that contributes the most.
Edit: Regarding constant values for a derivative, in the landscape metaphor it would mean that the gradient is the same no matter where you are, so you'll always go in the same direction and never reach an optimum. However, multi-layer networks with linear activation function are kind of besides the point anyhow when you consider that each cell computes a linear combination of its inputs, which then is again a linear function, so the output of the last layer will ultimately be a linear function of the inputs at the first layer. That is to say, anything you can do with a multi-layer net with linear activation functions, you could also achieve with just a single layer.
$endgroup$
$begingroup$
Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
$endgroup$
– Tina J
Aug 15 at 1:43
1
$begingroup$
@Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
$endgroup$
– Jens Classen
Aug 15 at 2:09
add a comment |
$begingroup$
The basic (and usual) algorithm used to update the weights of the artificial neural network (ANN) is an iterative, numerical and optimization algorithm, called gradient descent, which is based on and requires the computation of the derivative of the function you want to find the minimum of. If the function you want to find the minimum of is multivariable, then, rather than the derivative, gradient descent requires the gradient, which is a vector where the $i$th element contains the partial derivative of the function with respect to the $i$th variable. Hence the name gradient descent, where the derivative of a function of one variable can be considered the gradient of the function.
In the case of ANNs, we usually have a loss function that we want to minimize: for example, the mean squared error (MSE). Therefore, in order to apply gradient descent to find the minimum of the MSE, we need to find the derivative or, more precisely, the gradient of the MSE. To do it, the back-propagation (an algorithm based on the chain rule) is often used, given that the MSE is a function of the ANN, which is a composite function of multiple non-linear functions, the activation functions, whose main purpose is thus to introduce non-linearity, or, in other words, it makes the ANN powerful. Given that the MSE is a function of the parameters of the ANN, then we need to find the partial derivative of the MSE with respect to all parameters of the ANN. In this process, we will also need to find the derivatives of the activation functions that each neuron applies to its linear combination of weights: to fully see this, you will need to learn the details of back-propagation! Hence the importance of the derivatives of the activation functions.
A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.
To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!
$endgroup$
1
$begingroup$
Comments are not for extended discussion; this conversation has been moved to chat.
$endgroup$
– Ben N♦
Aug 15 at 19:54
add a comment |
$begingroup$
Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$
The goal is to fit a function that best explains our dataset.We can fit a simple function, as we do in linear regression. But that's different about neural networks, where we fit a complex function, say:
$beginalignh(x) & = h(x_1,x_2,x_3)\
& =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$
where, $theta = w_14,w_24,w_34,b_4,w_15,w_25,w_35,b_5,w_46,w_56,b_6$ is the set of the respective coefficients we have to determine such that we minimize:
$$J(theta) = frac12sum_i=1^N (y^(i)-h(x^(i)))^2$$
The above optimization problem can be easily solved with gradient descent. Just initiate $theta$ with random values and with proper learning parameter $eta$, update as follows till convergence:
$$theta:=theta-etafracpartial Jpartial theta$$
In order to get the gradients, we express the above function as a neural network as follows:
Let's calculate the gradient, say w.r.t. $w_14$.
$$fracpartial Jpartial w_14 = sum_i=1^N Big[big(h(x^(i))-y^(i)big)fracpartial h(x^(i))partial w_14Big]$$
Let $p(x) = w_14x_1+w_24x_2+w_34x_3+b_4$ , and
Let $q(x) = w_46timessigma(p(x))+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)$
$therefore fracpartial h(x)partial w_14 = fracpartial h(x)partial q(x)timesfracpartial q(x)partial p(x)timesfracpartial p(x)partial w_14 = fracpartialsigma(q(x))partial q(x)timesfracpartialsigma(p(x))partial p(x)timesfracpartial p(x)partial w_14$
We see that the derivative of the activation function is important for getting the gradients and so for the learning of the neural network. A constant derivative will not help in the gradient descent and we won't be able to learn the optimal parameters.
New contributor
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "658"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f13978%2fwhy-is-the-derivative-of-the-activation-functions-in-neural-networks-important%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:
Recall that the derivative tells you a function's sensitivity to change with respect to a change in its input. A high (absolute) value for the derivative at a certain point means that the function is very steep, and a small change in input may result in a drastic change in its output; conversely, a low absolute value means little change, so not steep at all, with the extreme case that the function is constant when the derivative is zero.
Training a neural network essentially amounts to an optimization problem where one wants to minimize a certain value, in this case the error produced by the network on the given training examples. Backpropagation learning can be viewed as a case of gradient descent (the inverse of hill climbing).
If for a moment we assume that your input is only 2-dimensional (just for illustration, the mathematics of course also work for higher dimensions), you could imagine the error function as a landscape with hills, mountains, valleys, ridges etc. You are standing at a high point and want to get down as far as possible. Gradient descent means that, in discrete steps, you always walk down in the direction that has the steepest slope downwards from where you are currently standing, until you eventually reach a (local) minimum.
In order to determine where that steepest slope is, you need the derivative of the activation function. Basically, you want to sort out how much each unit in your network contributes to an error, and adjust in the direction that contributes the most.
Edit: Regarding constant values for a derivative, in the landscape metaphor it would mean that the gradient is the same no matter where you are, so you'll always go in the same direction and never reach an optimum. However, multi-layer networks with linear activation function are kind of besides the point anyhow when you consider that each cell computes a linear combination of its inputs, which then is again a linear function, so the output of the last layer will ultimately be a linear function of the inputs at the first layer. That is to say, anything you can do with a multi-layer net with linear activation functions, you could also achieve with just a single layer.
$endgroup$
$begingroup$
Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
$endgroup$
– Tina J
Aug 15 at 1:43
1
$begingroup$
@Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
$endgroup$
– Jens Classen
Aug 15 at 2:09
add a comment |
$begingroup$
If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:
Recall that the derivative tells you a function's sensitivity to change with respect to a change in its input. A high (absolute) value for the derivative at a certain point means that the function is very steep, and a small change in input may result in a drastic change in its output; conversely, a low absolute value means little change, so not steep at all, with the extreme case that the function is constant when the derivative is zero.
Training a neural network essentially amounts to an optimization problem where one wants to minimize a certain value, in this case the error produced by the network on the given training examples. Backpropagation learning can be viewed as a case of gradient descent (the inverse of hill climbing).
If for a moment we assume that your input is only 2-dimensional (just for illustration, the mathematics of course also work for higher dimensions), you could imagine the error function as a landscape with hills, mountains, valleys, ridges etc. You are standing at a high point and want to get down as far as possible. Gradient descent means that, in discrete steps, you always walk down in the direction that has the steepest slope downwards from where you are currently standing, until you eventually reach a (local) minimum.
In order to determine where that steepest slope is, you need the derivative of the activation function. Basically, you want to sort out how much each unit in your network contributes to an error, and adjust in the direction that contributes the most.
Edit: Regarding constant values for a derivative, in the landscape metaphor it would mean that the gradient is the same no matter where you are, so you'll always go in the same direction and never reach an optimum. However, multi-layer networks with linear activation function are kind of besides the point anyhow when you consider that each cell computes a linear combination of its inputs, which then is again a linear function, so the output of the last layer will ultimately be a linear function of the inputs at the first layer. That is to say, anything you can do with a multi-layer net with linear activation functions, you could also achieve with just a single layer.
$endgroup$
$begingroup$
Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
$endgroup$
– Tina J
Aug 15 at 1:43
1
$begingroup$
@Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
$endgroup$
– Jens Classen
Aug 15 at 2:09
add a comment |
$begingroup$
If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:
Recall that the derivative tells you a function's sensitivity to change with respect to a change in its input. A high (absolute) value for the derivative at a certain point means that the function is very steep, and a small change in input may result in a drastic change in its output; conversely, a low absolute value means little change, so not steep at all, with the extreme case that the function is constant when the derivative is zero.
Training a neural network essentially amounts to an optimization problem where one wants to minimize a certain value, in this case the error produced by the network on the given training examples. Backpropagation learning can be viewed as a case of gradient descent (the inverse of hill climbing).
If for a moment we assume that your input is only 2-dimensional (just for illustration, the mathematics of course also work for higher dimensions), you could imagine the error function as a landscape with hills, mountains, valleys, ridges etc. You are standing at a high point and want to get down as far as possible. Gradient descent means that, in discrete steps, you always walk down in the direction that has the steepest slope downwards from where you are currently standing, until you eventually reach a (local) minimum.
In order to determine where that steepest slope is, you need the derivative of the activation function. Basically, you want to sort out how much each unit in your network contributes to an error, and adjust in the direction that contributes the most.
Edit: Regarding constant values for a derivative, in the landscape metaphor it would mean that the gradient is the same no matter where you are, so you'll always go in the same direction and never reach an optimum. However, multi-layer networks with linear activation function are kind of besides the point anyhow when you consider that each cell computes a linear combination of its inputs, which then is again a linear function, so the output of the last layer will ultimately be a linear function of the inputs at the first layer. That is to say, anything you can do with a multi-layer net with linear activation functions, you could also achieve with just a single layer.
$endgroup$
If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:
Recall that the derivative tells you a function's sensitivity to change with respect to a change in its input. A high (absolute) value for the derivative at a certain point means that the function is very steep, and a small change in input may result in a drastic change in its output; conversely, a low absolute value means little change, so not steep at all, with the extreme case that the function is constant when the derivative is zero.
Training a neural network essentially amounts to an optimization problem where one wants to minimize a certain value, in this case the error produced by the network on the given training examples. Backpropagation learning can be viewed as a case of gradient descent (the inverse of hill climbing).
If for a moment we assume that your input is only 2-dimensional (just for illustration, the mathematics of course also work for higher dimensions), you could imagine the error function as a landscape with hills, mountains, valleys, ridges etc. You are standing at a high point and want to get down as far as possible. Gradient descent means that, in discrete steps, you always walk down in the direction that has the steepest slope downwards from where you are currently standing, until you eventually reach a (local) minimum.
In order to determine where that steepest slope is, you need the derivative of the activation function. Basically, you want to sort out how much each unit in your network contributes to an error, and adjust in the direction that contributes the most.
Edit: Regarding constant values for a derivative, in the landscape metaphor it would mean that the gradient is the same no matter where you are, so you'll always go in the same direction and never reach an optimum. However, multi-layer networks with linear activation function are kind of besides the point anyhow when you consider that each cell computes a linear combination of its inputs, which then is again a linear function, so the output of the last layer will ultimately be a linear function of the inputs at the first layer. That is to say, anything you can do with a multi-layer net with linear activation functions, you could also achieve with just a single layer.
edited Aug 15 at 1:01
answered Aug 15 at 0:39
Jens ClassenJens Classen
1416 bronze badges
1416 bronze badges
$begingroup$
Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
$endgroup$
– Tina J
Aug 15 at 1:43
1
$begingroup$
@Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
$endgroup$
– Jens Classen
Aug 15 at 2:09
add a comment |
$begingroup$
Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
$endgroup$
– Tina J
Aug 15 at 1:43
1
$begingroup$
@Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
$endgroup$
– Jens Classen
Aug 15 at 2:09
$begingroup$
Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
$endgroup$
– Tina J
Aug 15 at 1:43
$begingroup$
Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
$endgroup$
– Tina J
Aug 15 at 1:43
1
1
$begingroup$
@Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
$endgroup$
– Jens Classen
Aug 15 at 2:09
$begingroup$
@Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
$endgroup$
– Jens Classen
Aug 15 at 2:09
add a comment |
$begingroup$
The basic (and usual) algorithm used to update the weights of the artificial neural network (ANN) is an iterative, numerical and optimization algorithm, called gradient descent, which is based on and requires the computation of the derivative of the function you want to find the minimum of. If the function you want to find the minimum of is multivariable, then, rather than the derivative, gradient descent requires the gradient, which is a vector where the $i$th element contains the partial derivative of the function with respect to the $i$th variable. Hence the name gradient descent, where the derivative of a function of one variable can be considered the gradient of the function.
In the case of ANNs, we usually have a loss function that we want to minimize: for example, the mean squared error (MSE). Therefore, in order to apply gradient descent to find the minimum of the MSE, we need to find the derivative or, more precisely, the gradient of the MSE. To do it, the back-propagation (an algorithm based on the chain rule) is often used, given that the MSE is a function of the ANN, which is a composite function of multiple non-linear functions, the activation functions, whose main purpose is thus to introduce non-linearity, or, in other words, it makes the ANN powerful. Given that the MSE is a function of the parameters of the ANN, then we need to find the partial derivative of the MSE with respect to all parameters of the ANN. In this process, we will also need to find the derivatives of the activation functions that each neuron applies to its linear combination of weights: to fully see this, you will need to learn the details of back-propagation! Hence the importance of the derivatives of the activation functions.
A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.
To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!
$endgroup$
1
$begingroup$
Comments are not for extended discussion; this conversation has been moved to chat.
$endgroup$
– Ben N♦
Aug 15 at 19:54
add a comment |
$begingroup$
The basic (and usual) algorithm used to update the weights of the artificial neural network (ANN) is an iterative, numerical and optimization algorithm, called gradient descent, which is based on and requires the computation of the derivative of the function you want to find the minimum of. If the function you want to find the minimum of is multivariable, then, rather than the derivative, gradient descent requires the gradient, which is a vector where the $i$th element contains the partial derivative of the function with respect to the $i$th variable. Hence the name gradient descent, where the derivative of a function of one variable can be considered the gradient of the function.
In the case of ANNs, we usually have a loss function that we want to minimize: for example, the mean squared error (MSE). Therefore, in order to apply gradient descent to find the minimum of the MSE, we need to find the derivative or, more precisely, the gradient of the MSE. To do it, the back-propagation (an algorithm based on the chain rule) is often used, given that the MSE is a function of the ANN, which is a composite function of multiple non-linear functions, the activation functions, whose main purpose is thus to introduce non-linearity, or, in other words, it makes the ANN powerful. Given that the MSE is a function of the parameters of the ANN, then we need to find the partial derivative of the MSE with respect to all parameters of the ANN. In this process, we will also need to find the derivatives of the activation functions that each neuron applies to its linear combination of weights: to fully see this, you will need to learn the details of back-propagation! Hence the importance of the derivatives of the activation functions.
A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.
To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!
$endgroup$
1
$begingroup$
Comments are not for extended discussion; this conversation has been moved to chat.
$endgroup$
– Ben N♦
Aug 15 at 19:54
add a comment |
$begingroup$
The basic (and usual) algorithm used to update the weights of the artificial neural network (ANN) is an iterative, numerical and optimization algorithm, called gradient descent, which is based on and requires the computation of the derivative of the function you want to find the minimum of. If the function you want to find the minimum of is multivariable, then, rather than the derivative, gradient descent requires the gradient, which is a vector where the $i$th element contains the partial derivative of the function with respect to the $i$th variable. Hence the name gradient descent, where the derivative of a function of one variable can be considered the gradient of the function.
In the case of ANNs, we usually have a loss function that we want to minimize: for example, the mean squared error (MSE). Therefore, in order to apply gradient descent to find the minimum of the MSE, we need to find the derivative or, more precisely, the gradient of the MSE. To do it, the back-propagation (an algorithm based on the chain rule) is often used, given that the MSE is a function of the ANN, which is a composite function of multiple non-linear functions, the activation functions, whose main purpose is thus to introduce non-linearity, or, in other words, it makes the ANN powerful. Given that the MSE is a function of the parameters of the ANN, then we need to find the partial derivative of the MSE with respect to all parameters of the ANN. In this process, we will also need to find the derivatives of the activation functions that each neuron applies to its linear combination of weights: to fully see this, you will need to learn the details of back-propagation! Hence the importance of the derivatives of the activation functions.
A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.
To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!
$endgroup$
The basic (and usual) algorithm used to update the weights of the artificial neural network (ANN) is an iterative, numerical and optimization algorithm, called gradient descent, which is based on and requires the computation of the derivative of the function you want to find the minimum of. If the function you want to find the minimum of is multivariable, then, rather than the derivative, gradient descent requires the gradient, which is a vector where the $i$th element contains the partial derivative of the function with respect to the $i$th variable. Hence the name gradient descent, where the derivative of a function of one variable can be considered the gradient of the function.
In the case of ANNs, we usually have a loss function that we want to minimize: for example, the mean squared error (MSE). Therefore, in order to apply gradient descent to find the minimum of the MSE, we need to find the derivative or, more precisely, the gradient of the MSE. To do it, the back-propagation (an algorithm based on the chain rule) is often used, given that the MSE is a function of the ANN, which is a composite function of multiple non-linear functions, the activation functions, whose main purpose is thus to introduce non-linearity, or, in other words, it makes the ANN powerful. Given that the MSE is a function of the parameters of the ANN, then we need to find the partial derivative of the MSE with respect to all parameters of the ANN. In this process, we will also need to find the derivatives of the activation functions that each neuron applies to its linear combination of weights: to fully see this, you will need to learn the details of back-propagation! Hence the importance of the derivatives of the activation functions.
A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.
To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!
edited Aug 15 at 1:29
answered Aug 15 at 0:43
nbronbro
6,5584 gold badges16 silver badges36 bronze badges
6,5584 gold badges16 silver badges36 bronze badges
1
$begingroup$
Comments are not for extended discussion; this conversation has been moved to chat.
$endgroup$
– Ben N♦
Aug 15 at 19:54
add a comment |
1
$begingroup$
Comments are not for extended discussion; this conversation has been moved to chat.
$endgroup$
– Ben N♦
Aug 15 at 19:54
1
1
$begingroup$
Comments are not for extended discussion; this conversation has been moved to chat.
$endgroup$
– Ben N♦
Aug 15 at 19:54
$begingroup$
Comments are not for extended discussion; this conversation has been moved to chat.
$endgroup$
– Ben N♦
Aug 15 at 19:54
add a comment |
$begingroup$
Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$
The goal is to fit a function that best explains our dataset.We can fit a simple function, as we do in linear regression. But that's different about neural networks, where we fit a complex function, say:
$beginalignh(x) & = h(x_1,x_2,x_3)\
& =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$
where, $theta = w_14,w_24,w_34,b_4,w_15,w_25,w_35,b_5,w_46,w_56,b_6$ is the set of the respective coefficients we have to determine such that we minimize:
$$J(theta) = frac12sum_i=1^N (y^(i)-h(x^(i)))^2$$
The above optimization problem can be easily solved with gradient descent. Just initiate $theta$ with random values and with proper learning parameter $eta$, update as follows till convergence:
$$theta:=theta-etafracpartial Jpartial theta$$
In order to get the gradients, we express the above function as a neural network as follows:
Let's calculate the gradient, say w.r.t. $w_14$.
$$fracpartial Jpartial w_14 = sum_i=1^N Big[big(h(x^(i))-y^(i)big)fracpartial h(x^(i))partial w_14Big]$$
Let $p(x) = w_14x_1+w_24x_2+w_34x_3+b_4$ , and
Let $q(x) = w_46timessigma(p(x))+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)$
$therefore fracpartial h(x)partial w_14 = fracpartial h(x)partial q(x)timesfracpartial q(x)partial p(x)timesfracpartial p(x)partial w_14 = fracpartialsigma(q(x))partial q(x)timesfracpartialsigma(p(x))partial p(x)timesfracpartial p(x)partial w_14$
We see that the derivative of the activation function is important for getting the gradients and so for the learning of the neural network. A constant derivative will not help in the gradient descent and we won't be able to learn the optimal parameters.
New contributor
$endgroup$
add a comment |
$begingroup$
Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$
The goal is to fit a function that best explains our dataset.We can fit a simple function, as we do in linear regression. But that's different about neural networks, where we fit a complex function, say:
$beginalignh(x) & = h(x_1,x_2,x_3)\
& =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$
where, $theta = w_14,w_24,w_34,b_4,w_15,w_25,w_35,b_5,w_46,w_56,b_6$ is the set of the respective coefficients we have to determine such that we minimize:
$$J(theta) = frac12sum_i=1^N (y^(i)-h(x^(i)))^2$$
The above optimization problem can be easily solved with gradient descent. Just initiate $theta$ with random values and with proper learning parameter $eta$, update as follows till convergence:
$$theta:=theta-etafracpartial Jpartial theta$$
In order to get the gradients, we express the above function as a neural network as follows:
Let's calculate the gradient, say w.r.t. $w_14$.
$$fracpartial Jpartial w_14 = sum_i=1^N Big[big(h(x^(i))-y^(i)big)fracpartial h(x^(i))partial w_14Big]$$
Let $p(x) = w_14x_1+w_24x_2+w_34x_3+b_4$ , and
Let $q(x) = w_46timessigma(p(x))+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)$
$therefore fracpartial h(x)partial w_14 = fracpartial h(x)partial q(x)timesfracpartial q(x)partial p(x)timesfracpartial p(x)partial w_14 = fracpartialsigma(q(x))partial q(x)timesfracpartialsigma(p(x))partial p(x)timesfracpartial p(x)partial w_14$
We see that the derivative of the activation function is important for getting the gradients and so for the learning of the neural network. A constant derivative will not help in the gradient descent and we won't be able to learn the optimal parameters.
New contributor
$endgroup$
add a comment |
$begingroup$
Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$
The goal is to fit a function that best explains our dataset.We can fit a simple function, as we do in linear regression. But that's different about neural networks, where we fit a complex function, say:
$beginalignh(x) & = h(x_1,x_2,x_3)\
& =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$
where, $theta = w_14,w_24,w_34,b_4,w_15,w_25,w_35,b_5,w_46,w_56,b_6$ is the set of the respective coefficients we have to determine such that we minimize:
$$J(theta) = frac12sum_i=1^N (y^(i)-h(x^(i)))^2$$
The above optimization problem can be easily solved with gradient descent. Just initiate $theta$ with random values and with proper learning parameter $eta$, update as follows till convergence:
$$theta:=theta-etafracpartial Jpartial theta$$
In order to get the gradients, we express the above function as a neural network as follows:
Let's calculate the gradient, say w.r.t. $w_14$.
$$fracpartial Jpartial w_14 = sum_i=1^N Big[big(h(x^(i))-y^(i)big)fracpartial h(x^(i))partial w_14Big]$$
Let $p(x) = w_14x_1+w_24x_2+w_34x_3+b_4$ , and
Let $q(x) = w_46timessigma(p(x))+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)$
$therefore fracpartial h(x)partial w_14 = fracpartial h(x)partial q(x)timesfracpartial q(x)partial p(x)timesfracpartial p(x)partial w_14 = fracpartialsigma(q(x))partial q(x)timesfracpartialsigma(p(x))partial p(x)timesfracpartial p(x)partial w_14$
We see that the derivative of the activation function is important for getting the gradients and so for the learning of the neural network. A constant derivative will not help in the gradient descent and we won't be able to learn the optimal parameters.
New contributor
$endgroup$
Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$
The goal is to fit a function that best explains our dataset.We can fit a simple function, as we do in linear regression. But that's different about neural networks, where we fit a complex function, say:
$beginalignh(x) & = h(x_1,x_2,x_3)\
& =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$
where, $theta = w_14,w_24,w_34,b_4,w_15,w_25,w_35,b_5,w_46,w_56,b_6$ is the set of the respective coefficients we have to determine such that we minimize:
$$J(theta) = frac12sum_i=1^N (y^(i)-h(x^(i)))^2$$
The above optimization problem can be easily solved with gradient descent. Just initiate $theta$ with random values and with proper learning parameter $eta$, update as follows till convergence:
$$theta:=theta-etafracpartial Jpartial theta$$
In order to get the gradients, we express the above function as a neural network as follows:
Let's calculate the gradient, say w.r.t. $w_14$.
$$fracpartial Jpartial w_14 = sum_i=1^N Big[big(h(x^(i))-y^(i)big)fracpartial h(x^(i))partial w_14Big]$$
Let $p(x) = w_14x_1+w_24x_2+w_34x_3+b_4$ , and
Let $q(x) = w_46timessigma(p(x))+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)$
$therefore fracpartial h(x)partial w_14 = fracpartial h(x)partial q(x)timesfracpartial q(x)partial p(x)timesfracpartial p(x)partial w_14 = fracpartialsigma(q(x))partial q(x)timesfracpartialsigma(p(x))partial p(x)timesfracpartial p(x)partial w_14$
We see that the derivative of the activation function is important for getting the gradients and so for the learning of the neural network. A constant derivative will not help in the gradient descent and we won't be able to learn the optimal parameters.
New contributor
edited Aug 24 at 5:08
New contributor
answered Aug 23 at 19:56
babkrbabkr
313 bronze badges
313 bronze badges
New contributor
New contributor
add a comment |
add a comment |
Thanks for contributing an answer to Artificial Intelligence Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f13978%2fwhy-is-the-derivative-of-the-activation-functions-in-neural-networks-important%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Can you cite some sources so that we can get a much more detailed picture?
$endgroup$
– DuttaA
Aug 14 at 23:15
$begingroup$
towardsdatascience.com/…
$endgroup$
– Tina J
Aug 14 at 23:27
$begingroup$
The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
$endgroup$
– DuttaA
Aug 14 at 23:42