Why is the derivative of the activation functions in neural networks important?Why do activation functions need to be differentiable in the context of neural networks?How do neural networks manage to do regression?Mathematical intuition for the use of Re-Lu's in Machine LearningWhich functions can be activation functions?Target values of 0.1 for 0 and 0.9 for 1 for sigmoidWeight Normalization paperWhy exactly do neural networks require i.i.d. data?What kind of functions can be used as activation functions?Why can't neural networks learn functions outside of the specified domains?Effect of rescaling of inputs on loss for a simple neural network

Why was Rhodes following Barton?

Why did the population of Bhutan drop by 70% between 2007 and 2008?

Why does Windows store Wi-Fi passwords in a reversible format?

Why can't UK MPs vote for the Withdrawal Agreement, then renege on the backstop if it comes to that?

Does the Reduce option from the Enlarge/Reduce spell cause a critical hit to do 2d4 less damage?

How do solar inverter systems easily add AC power sources together?

Finding square root without division and initial guess

Is it true that different variants of the same model aircraft don't require pilot retraining?

Defending Castle from Zombies

How do I insert two edge loops equally spaced from the edges?

Will removing shelving screws from studs damage the studs?

Time difference between banns and marriage

rationalizing sieges in a modern/near-future setting

Count the number of shortest paths to n

Why did Lucius make a deal out of Buckbeak hurting Draco but not about Draco being turned into a ferret?

Force SQL Server to use fragmented indexes?

How many petaflops does it take to land on the moon? What does Artemis need with an Aitken?

Should I use the words "pyromancy" and "necromancy" even if they don't mean what people think they do?

Do sharpies or markers damage soft gear?

A probably wrong proof of the Riemann Hypothesis, but where is the mistake?

How to force GCC to assume that a floating-point expression is non-negative?

How to emphasise the insignificance of someone/thing – besides using "klein"

Is the Amazon rainforest the "world's lungs"?

What is Soda Fountain Etiquette?

Why is the derivative of the activation functions in neural networks important?

Why do activation functions need to be differentiable in the context of neural networks?How do neural networks manage to do regression?Mathematical intuition for the use of Re-Lu's in Machine LearningWhich functions can be activation functions?Target values of 0.1 for 0 and 0.9 for 1 for sigmoidWeight Normalization paperWhy exactly do neural networks require i.i.d. data?What kind of functions can be used as activation functions?Why can't neural networks learn functions outside of the specified domains?Effect of rescaling of inputs on loss for a simple neural network

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I'm new to NN. I am trying to understand some of its foundations. One question that I have is: why the derivative of an activation function is important (not the function itself), and why it's the derivative which is tied to how the network performs learning?

For instance, when we say a constant derivative isn't good for learning, what is the intuition behind that? Is the activation function somehow like a hash function that needs to well differentiate small variance in inputs?

edited Aug 15 at 0:06

nbro

6,5584 gold badges16 silver badges36 bronze badges

asked Aug 14 at 22:30

Tina J

3071 silver badge9 bronze badges

$begingroup$
Can you cite some sources so that we can get a much more detailed picture?
$endgroup$
– DuttaA
Aug 14 at 23:15

$begingroup$
towardsdatascience.com/…
$endgroup$
– Tina J
Aug 14 at 23:27

$begingroup$
The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
$endgroup$
– DuttaA
Aug 14 at 23:42

add a comment |

edited Aug 15 at 0:06

nbro

6,5584 gold badges16 silver badges36 bronze badges

asked Aug 14 at 22:30

Tina J

3071 silver badge9 bronze badges

$begingroup$
Can you cite some sources so that we can get a much more detailed picture?
$endgroup$
– DuttaA
Aug 14 at 23:15

$begingroup$
towardsdatascience.com/…
$endgroup$
– Tina J
Aug 14 at 23:27

$begingroup$
The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
$endgroup$
– DuttaA
Aug 14 at 23:42

add a comment |

edited Aug 15 at 0:06

nbro

6,5584 gold badges16 silver badges36 bronze badges

asked Aug 14 at 22:30

Tina J

3071 silver badge9 bronze badges

neural-networks machine-learning deep-learning math activation-function

edited Aug 15 at 0:06

nbro

6,5584 gold badges16 silver badges36 bronze badges

asked Aug 14 at 22:30

Tina J

3071 silver badge9 bronze badges

edited Aug 15 at 0:06

nbro

6,5584 gold badges16 silver badges36 bronze badges

asked Aug 14 at 22:30

Tina J

3071 silver badge9 bronze badges

edited Aug 15 at 0:06

nbro

6,5584 gold badges16 silver badges36 bronze badges

edited Aug 15 at 0:06

nbro

6,5584 gold badges16 silver badges36 bronze badges

edited Aug 15 at 0:06

nbro

6,5584 gold badges16 silver badges36 bronze badges

asked Aug 14 at 22:30

Tina J

3071 silver badge9 bronze badges

asked Aug 14 at 22:30

Tina J

3071 silver badge9 bronze badges

asked Aug 14 at 22:30

Tina J

3071 silver badge9 bronze badges

$begingroup$
Can you cite some sources so that we can get a much more detailed picture?
$endgroup$
– DuttaA
Aug 14 at 23:15

$begingroup$
towardsdatascience.com/…
$endgroup$
– Tina J
Aug 14 at 23:27

$begingroup$
The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
$endgroup$
– DuttaA
Aug 14 at 23:42

add a comment |

$begingroup$
Can you cite some sources so that we can get a much more detailed picture?
$endgroup$
– DuttaA
Aug 14 at 23:15

$begingroup$
towardsdatascience.com/…
$endgroup$
– Tina J
Aug 14 at 23:27

$begingroup$
The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.
$endgroup$
– DuttaA
Aug 14 at 23:42

Can you cite some sources so that we can get a much more detailed picture?

– DuttaA
Aug 14 at 23:15

towardsdatascience.com/…

– Tina J
Aug 14 at 23:27

The 'constant.....' statement is not really correct in my opinion, or atleast the constant derivative means the model is not learning conclusion is incorrect. But the author really doesn't delve into details nor provide proper explanation, so the author probably might have a different way of interpreting it. Also it is kind of sketchy to talk about learning when the details of a learning objective commonly known as loss function is not provided.

– DuttaA
Aug 14 at 23:42

add a comment |

3 Answers
3

active

oldest

votes

If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:

Recall that the derivative tells you a function's sensitivity to change with respect to a change in its input. A high (absolute) value for the derivative at a certain point means that the function is very steep, and a small change in input may result in a drastic change in its output; conversely, a low absolute value means little change, so not steep at all, with the extreme case that the function is constant when the derivative is zero.

Training a neural network essentially amounts to an optimization problem where one wants to minimize a certain value, in this case the error produced by the network on the given training examples. Backpropagation learning can be viewed as a case of gradient descent (the inverse of hill climbing).

If for a moment we assume that your input is only 2-dimensional (just for illustration, the mathematics of course also work for higher dimensions), you could imagine the error function as a landscape with hills, mountains, valleys, ridges etc. You are standing at a high point and want to get down as far as possible. Gradient descent means that, in discrete steps, you always walk down in the direction that has the steepest slope downwards from where you are currently standing, until you eventually reach a (local) minimum.

In order to determine where that steepest slope is, you need the derivative of the activation function. Basically, you want to sort out how much each unit in your network contributes to an error, and adjust in the direction that contributes the most.

Edit: Regarding constant values for a derivative, in the landscape metaphor it would mean that the gradient is the same no matter where you are, so you'll always go in the same direction and never reach an optimum. However, multi-layer networks with linear activation function are kind of besides the point anyhow when you consider that each cell computes a linear combination of its inputs, which then is again a linear function, so the output of the last layer will ultimately be a linear function of the inputs at the first layer. That is to say, anything you can do with a multi-layer net with linear activation functions, you could also achieve with just a single layer.

edited Aug 15 at 1:01

answered Aug 15 at 0:39

Jens Classen

1416 bronze badges

$begingroup$
Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
$endgroup$
– Tina J
Aug 15 at 1:43

1

$begingroup$
@Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
$endgroup$
– Jens Classen
Aug 15 at 2:09

add a comment |

The basic (and usual) algorithm used to update the weights of the artificial neural network (ANN) is an iterative, numerical and optimization algorithm, called gradient descent, which is based on and requires the computation of the derivative of the function you want to find the minimum of. If the function you want to find the minimum of is multivariable, then, rather than the derivative, gradient descent requires the gradient, which is a vector where the $i$th element contains the partial derivative of the function with respect to the $i$th variable. Hence the name gradient descent, where the derivative of a function of one variable can be considered the gradient of the function.

In the case of ANNs, we usually have a loss function that we want to minimize: for example, the mean squared error (MSE). Therefore, in order to apply gradient descent to find the minimum of the MSE, we need to find the derivative or, more precisely, the gradient of the MSE. To do it, the back-propagation (an algorithm based on the chain rule) is often used, given that the MSE is a function of the ANN, which is a composite function of multiple non-linear functions, the activation functions, whose main purpose is thus to introduce non-linearity, or, in other words, it makes the ANN powerful. Given that the MSE is a function of the parameters of the ANN, then we need to find the partial derivative of the MSE with respect to all parameters of the ANN. In this process, we will also need to find the derivatives of the activation functions that each neuron applies to its linear combination of weights: to fully see this, you will need to learn the details of back-propagation! Hence the importance of the derivatives of the activation functions.

A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.

To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!

edited Aug 15 at 1:29

answered Aug 15 at 0:43

nbro

6,5584 gold badges16 silver badges36 bronze badges

1

$begingroup$
Comments are not for extended discussion; this conversation has been moved to chat.
$endgroup$
– Ben N♦
Aug 15 at 19:54

add a comment |

Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$

The goal is to fit a function that best explains our dataset.We can fit a simple function, as we do in linear regression. But that's different about neural networks, where we fit a complex function, say:

$beginalignh(x) & = h(x_1,x_2,x_3)\
& =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$

where, $theta = w_14,w_24,w_34,b_4,w_15,w_25,w_35,b_5,w_46,w_56,b_6$ is the set of the respective coefficients we have to determine such that we minimize:
$$J(theta) = frac12sum_i=1^N (y^(i)-h(x^(i)))^2$$
The above optimization problem can be easily solved with gradient descent. Just initiate $theta$ with random values and with proper learning parameter $eta$, update as follows till convergence:
$$theta:=theta-etafracpartial Jpartial theta$$

In order to get the gradients, we express the above function as a neural network as follows:
enter image description here

Let's calculate the gradient, say w.r.t. $w_14$.
$$fracpartial Jpartial w_14 = sum_i=1^N Big[big(h(x^(i))-y^(i)big)fracpartial h(x^(i))partial w_14Big]$$
Let $p(x) = w_14x_1+w_24x_2+w_34x_3+b_4$ , and

Let $q(x) = w_46timessigma(p(x))+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)$

$therefore fracpartial h(x)partial w_14 = fracpartial h(x)partial q(x)timesfracpartial q(x)partial p(x)timesfracpartial p(x)partial w_14 = fracpartialsigma(q(x))partial q(x)timesfracpartialsigma(p(x))partial p(x)timesfracpartial p(x)partial w_14$

We see that the derivative of the activation function is important for getting the gradients and so for the learning of the neural network. A constant derivative will not help in the gradient descent and we won't be able to learn the optimal parameters.

edited Aug 24 at 5:08

answered Aug 23 at 19:56

babkr

313 bronze badges

New contributor

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "658"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fai.stackexchange.com%2fquestions%2f13978%2fwhy-is-the-derivative-of-the-activation-functions-in-neural-networks-important%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:

edited Aug 15 at 1:01

answered Aug 15 at 0:39

Jens Classen

1416 bronze badges

$begingroup$
Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
$endgroup$
– Tina J
Aug 15 at 1:43

1

$begingroup$
@Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
$endgroup$
– Jens Classen
Aug 15 at 2:09

add a comment |

If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:

edited Aug 15 at 1:01

answered Aug 15 at 0:39

Jens Classen

1416 bronze badges

$begingroup$
Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
$endgroup$
– Tina J
Aug 15 at 1:43

1

$begingroup$
@Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
$endgroup$
– Jens Classen
Aug 15 at 2:09

add a comment |

If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:

edited Aug 15 at 1:01

answered Aug 15 at 0:39

Jens Classen

1416 bronze badges

If what you are asking is what is the intuition for using the derivative in backpropagation learning, instead of an in-depth mathematical explanation:

edited Aug 15 at 1:01

answered Aug 15 at 0:39

Jens Classen

1416 bronze badges

edited Aug 15 at 1:01

answered Aug 15 at 0:39

Jens Classen

1416 bronze badges

answered Aug 15 at 0:39

Jens Classen

1416 bronze badges

answered Aug 15 at 0:39

Jens Classen

1416 bronze badges

$begingroup$
Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
$endgroup$
– Tina J
Aug 15 at 1:43

1

$begingroup$
@Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
$endgroup$
– Jens Classen
Aug 15 at 2:09

add a comment |

$begingroup$
Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!
$endgroup$
– Tina J
Aug 15 at 1:43

1

$begingroup$
@Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.
$endgroup$
– Jens Classen
Aug 15 at 2:09

Thanks. It was a good starter explanation. I understand we want to minimize the whole loss function. But why we need a local minimum at each function?!

– Tina J
Aug 15 at 1:43

@Tina J: I am not sure what you are asking. You are correct that we try to find a single minimum for the error of the entire network. What backpropagation does is to split the observed error up into the parts contributed by each single unit and connection. So we don't minimize at each single unit, but for each training example, we (potentially) adjust every edge's weight, depending on how much it affected the outcome to be wrong. Each weight is a dimension of the "landscape", and one traversal of the net is a single step in gradient descent, which is repeated until reaching convergence.

– Jens Classen
Aug 15 at 2:09

add a comment |

A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.

To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!

edited Aug 15 at 1:29

answered Aug 15 at 0:43

nbro

6,5584 gold badges16 silver badges36 bronze badges

1

$begingroup$
Comments are not for extended discussion; this conversation has been moved to chat.
$endgroup$
– Ben N♦
Aug 15 at 19:54

add a comment |

A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.

To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!

edited Aug 15 at 1:29

answered Aug 15 at 0:43

nbro

6,5584 gold badges16 silver badges36 bronze badges

1

$begingroup$
Comments are not for extended discussion; this conversation has been moved to chat.
$endgroup$
– Ben N♦
Aug 15 at 19:54

add a comment |

A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.

To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!

edited Aug 15 at 1:29

answered Aug 15 at 0:43

nbro

6,5584 gold badges16 silver badges36 bronze badges

A constant derivative would always give the same learning signal, independently of the error, but this is not desirable.

To fully understand all these statements, I recommend you learn about back-propagation and gradient descent in detail, which requires a little bit of effort!

edited Aug 15 at 1:29

answered Aug 15 at 0:43

nbro

6,5584 gold badges16 silver badges36 bronze badges

edited Aug 15 at 1:29

answered Aug 15 at 0:43

nbro

6,5584 gold badges16 silver badges36 bronze badges

answered Aug 15 at 0:43

nbro

6,5584 gold badges16 silver badges36 bronze badges

answered Aug 15 at 0:43

nbro

6,5584 gold badges16 silver badges36 bronze badges

1

$begingroup$
Comments are not for extended discussion; this conversation has been moved to chat.
$endgroup$
– Ben N♦
Aug 15 at 19:54

add a comment |

1

$begingroup$
Comments are not for extended discussion; this conversation has been moved to chat.
$endgroup$
– Ben N♦
Aug 15 at 19:54

Comments are not for extended discussion; this conversation has been moved to chat.

– Ben N♦
Aug 15 at 19:54

add a comment |

Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$

$beginalignh(x) & = h(x_1,x_2,x_3)\
& =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$

In order to get the gradients, we express the above function as a neural network as follows:
enter image description here

edited Aug 24 at 5:08

answered Aug 23 at 19:56

babkr

313 bronze badges

New contributor

add a comment |

Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$

$beginalignh(x) & = h(x_1,x_2,x_3)\
& =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$

In order to get the gradients, we express the above function as a neural network as follows:
enter image description here

edited Aug 24 at 5:08

answered Aug 23 at 19:56

babkr

313 bronze badges

New contributor

add a comment |

Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$

$beginalignh(x) & = h(x_1,x_2,x_3)\
& =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$

In order to get the gradients, we express the above function as a neural network as follows:
enter image description here

edited Aug 24 at 5:08

answered Aug 23 at 19:56

babkr

313 bronze badges

New contributor

Consider a dataset $mathcalD=x^(i),y^(i):i=1,2,ldots,N$ where $x^(i)inmathbbR^3$ and $y^(i)inmathbbR$ $forall i$

$beginalignh(x) & = h(x_1,x_2,x_3)\
& =sigma(w_46timessigma(w_14x_1+w_24x_2+w_34x_3+b_4)+w_56timessigma(w_15x_1+w_25x_2+w_35x_3+b_5)+b_6)endalign$

In order to get the gradients, we express the above function as a neural network as follows:
enter image description here

edited Aug 24 at 5:08

answered Aug 23 at 19:56

babkr

313 bronze badges

New contributor

edited Aug 24 at 5:08

answered Aug 23 at 19:56

babkr

313 bronze badges

New contributor

answered Aug 23 at 19:56

babkr

313 bronze badges

answered Aug 23 at 19:56

babkr

313 bronze badges

New contributor

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Artificial Intelligence Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

jvefTwgG,J6iupw1,Zz6RyCKJ,D0d7JGNL1U,QpiT,pB PW1cykn14b h8kZj

搜尋此網誌

Ttdfjt

3 Answers
3

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Post as a guest

Popular posts from this blog

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

3 Answers
3

3 Answers
3

3 Answers
3