Does the Jensen-Shannon divergence maximise likelihood?Maximum Likelihood Estimation with Known Parameter DistributionKLDIV Kullback-Leibler or Jensen-Shannon divergence between two distributionsJensen Shannon Divergence vs Kullback-Leibler Divergence?“weight” input in glm.nb function in R. How exactly does the weight affect the likelihood?Can you write a probability based on the relative entropy?Jensen-Shannon Divergence for multiple probability distributions?MLE: Marginal vs Full LikelihoodWhat Is Meant by “Maximising” Posterior Probability?How is $P(D;theta) = P(D|theta)$?skew G-Jensen-Shannon divergence between multivariate gaussian calculation discrepancy

If prion is a protein. Why is it not disassembled by the digestive system?

How do I tell my manager that his code review comment is wrong?

Should one double the thirds or the fifth in chords?

Which industry am I working in? Software development or financial services?

How can I get a job without pushing my family's income into a higher tax bracket?

What happens if I start too many background jobs?

Catholic vs Protestant Support for Nazism in Germany

Airbnb - host wants to reduce rooms, can we get refund?

Why wasn't the Night King naked in S08E03?

LightGBM results differently depending on the order of the data

Was there ever a Kickstart that took advantage of 68020+ instructions that would work on an A2000?

Quoting Yourself

What is it called when you multiply something eight times?

What happens to matryoshka Mordenkainen's Magnificent Mansions?

Manager is threatning to grade me poorly if I don't complete the project

Missed the connecting flight, separate tickets on same airline - who is responsible?

For a benzene shown in a skeletal structure, what does a substituent to the center of the ring mean?

Moving the subject of the sentence into a dangling participle

Where can I go to avoid planes overhead?

How would adding a darkvision racial trait to Dragonborn affect balance?

In Avengers 1, why does Thanos need Loki?

Besides the up and down quark, what other quarks are present in daily matter around us?

If Earth is tilted, why is Polaris always above the same spot?

In a vacuum triode, what prevents the grid from acting as another anode?



Does the Jensen-Shannon divergence maximise likelihood?


Maximum Likelihood Estimation with Known Parameter DistributionKLDIV Kullback-Leibler or Jensen-Shannon divergence between two distributionsJensen Shannon Divergence vs Kullback-Leibler Divergence?“weight” input in glm.nb function in R. How exactly does the weight affect the likelihood?Can you write a probability based on the relative entropy?Jensen-Shannon Divergence for multiple probability distributions?MLE: Marginal vs Full LikelihoodWhat Is Meant by “Maximising” Posterior Probability?How is $P(D;theta) = P(D|theta)$?skew G-Jensen-Shannon divergence between multivariate gaussian calculation discrepancy






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








3












$begingroup$


Minimising the KL divergence between your model distribution and the true data distribution is equivalent to maximising the (log-) likelihood.



In machine learning, we often want to create a model with some parameter(s) $theta$ that maximises the likelihood of some distribution. I have a couple of questions regarding how minimising other divergence measures optimise our model. In particular:



1) Does the Jensen Shannon Divergence also maximise the likelihood? If not what does it maximise?



2) Does the reverse KL divergence also maximise the likelihood? If not what does it maximise?



Edit:



As you can see from the figure below from this paper , the KL and JSD have different optimal solutions, so if minimising the KL is equivalent to optimising the likelihood, then the same cannot necessarily be the case for JSD.



enter image description here










share|cite|improve this question











$endgroup$


















    3












    $begingroup$


    Minimising the KL divergence between your model distribution and the true data distribution is equivalent to maximising the (log-) likelihood.



    In machine learning, we often want to create a model with some parameter(s) $theta$ that maximises the likelihood of some distribution. I have a couple of questions regarding how minimising other divergence measures optimise our model. In particular:



    1) Does the Jensen Shannon Divergence also maximise the likelihood? If not what does it maximise?



    2) Does the reverse KL divergence also maximise the likelihood? If not what does it maximise?



    Edit:



    As you can see from the figure below from this paper , the KL and JSD have different optimal solutions, so if minimising the KL is equivalent to optimising the likelihood, then the same cannot necessarily be the case for JSD.



    enter image description here










    share|cite|improve this question











    $endgroup$














      3












      3








      3


      3



      $begingroup$


      Minimising the KL divergence between your model distribution and the true data distribution is equivalent to maximising the (log-) likelihood.



      In machine learning, we often want to create a model with some parameter(s) $theta$ that maximises the likelihood of some distribution. I have a couple of questions regarding how minimising other divergence measures optimise our model. In particular:



      1) Does the Jensen Shannon Divergence also maximise the likelihood? If not what does it maximise?



      2) Does the reverse KL divergence also maximise the likelihood? If not what does it maximise?



      Edit:



      As you can see from the figure below from this paper , the KL and JSD have different optimal solutions, so if minimising the KL is equivalent to optimising the likelihood, then the same cannot necessarily be the case for JSD.



      enter image description here










      share|cite|improve this question











      $endgroup$




      Minimising the KL divergence between your model distribution and the true data distribution is equivalent to maximising the (log-) likelihood.



      In machine learning, we often want to create a model with some parameter(s) $theta$ that maximises the likelihood of some distribution. I have a couple of questions regarding how minimising other divergence measures optimise our model. In particular:



      1) Does the Jensen Shannon Divergence also maximise the likelihood? If not what does it maximise?



      2) Does the reverse KL divergence also maximise the likelihood? If not what does it maximise?



      Edit:



      As you can see from the figure below from this paper , the KL and JSD have different optimal solutions, so if minimising the KL is equivalent to optimising the likelihood, then the same cannot necessarily be the case for JSD.



      enter image description here







      machine-learning maximum-likelihood kullback-leibler log-likelihood






      share|cite|improve this question















      share|cite|improve this question













      share|cite|improve this question




      share|cite|improve this question








      edited Apr 27 at 11:55









      gui11aume

      11.2k23684




      11.2k23684










      asked Apr 27 at 10:16









      MellowMellow

      16218




      16218




















          1 Answer
          1






          active

          oldest

          votes


















          4












          $begingroup$

          First, it is important to clarify a few things.



          1. The KL divergence is a dissimilarity between two distributions, so it cannot maximize the likelihood, which is a function of a single distribution.

          2. Given a reference distribution $P(cdot)$, the value of $theta$ that minimizes $textKL(P(cdot)||Q(cdot|theta))$ is not the one that maximizes the likelihood. Actually, there is no likelihood because there is no observed value.

          So, saying that minimizing the KL divergence is equivalent to maximizing the log-likelihood can only mean that choosing $hattheta$ so as to maximize $Q(x_1, ldots, x_n|theta)$, ensures that $ hattheta rightarrow theta^*$, where



          $$theta^* = textargmin_theta text KL(P(cdot)||Q(cdot|theta)).$$



          This is true under some usual regularity conditions. To see this, assume that we compute $Q(x_1, ldots, x_n|theta)$, but the sample $x_1, ldots, x_n$ is actually drawn from $P(cdot)$. The expected value of the log-likelihood is then



          $$int P(x_1, ldots, x_n) log Q(x_1, ldots, x_n|theta) dx_1 ldots dx_n.$$



          Maximizing this value with respect to $theta$ is he same as minimizing



          $$textKL(P(cdot)||Q(cdot|theta)) = int P(x_1, ldots, x_n) log fracP(x_1, ldots, x_n)theta)dx_1 ldots dx_n.$$



          This is not an actual proof, but this gives you the main idea. Now, there is no reason why $theta^*$ should also minimize



          $$textKL(Q(cdot|theta)||P(cdot)) = int Q(x_1, ldots, x_n|theta) log fractheta)P(x_1, ldots, x_n)dx_1 ldots dx_n.$$



          Your question actually provides a counter-example of this, so it is clear that the value of $theta$ that minimizes the reverse KL divergence is in general not the same as the maximum likelihood estimate (and thus the same goes for the Jensen-Shannon divergence).



          What those values minimize is not so well defined. From the argument above, you can see that the minimum of the reverse KL divergence corresponds to computing the likelihood as $P(x_1, ldots, x_n)$ when $x_1, ldots, x_n$ is actually drawn from $Q(cdot|theta)$, while trying to keep the entropy of $Q(cdot|theta)$ as high as possible. The interpretation is not straightforward, but we can think of it as trying to find a "simple" distribution $Q(cdot|theta)$ that would "explain" the observations $x_1, ldots, x_n$ coming from a more complex distribution $P(cdot)$. This is a typical task of variational inference.



          The Jensen-Shannon divergence is the average of the two, so one can think of finding a minimum as "a little bit of both", meaning something in between the maximum likelihood estimate and a "simple explanation" for the data.






          share|cite|improve this answer











          $endgroup$








          • 1




            $begingroup$
            Thanks for your highly descriptive and informative answer. I am a little confused though by your last two sentences. If you look at the very first figure on arxiv.org/abs/1511.01844 you can see that the KLD and the JSD converge to different solutions so their optimal objectives can't be the same. I.e. they can't both be equivalent.
            $endgroup$
            – Mellow
            Apr 27 at 11:19







          • 1




            $begingroup$
            Hi @gui11aume, I have updated my original post to add the figure.
            $endgroup$
            – Mellow
            Apr 27 at 11:54






          • 1




            $begingroup$
            OK, I see your point. My answer is incomplete because it assumes that we have the same family of distributions in the KL divergence (in line with the blog post you linked to). I will update the answer.
            $endgroup$
            – gui11aume
            Apr 27 at 12:01






          • 1




            $begingroup$
            Thanks and btw your blog looks quite cool
            $endgroup$
            – Mellow
            Apr 27 at 12:17






          • 1




            $begingroup$
            Thanks for your updated awesome answer. I think there's a small typo - I think the reverse KL tries to increase the entropy of Q, as $KL(Q||P)$ = $E_Q[log Q] - E_Q[log P]$ = $-H[Q] - E_Q[log P]$. So minimising the KL would maximise the entropy of Q
            $endgroup$
            – Mellow
            2 days ago












          Your Answer








          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "65"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f405355%2fdoes-the-jensen-shannon-divergence-maximise-likelihood%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          4












          $begingroup$

          First, it is important to clarify a few things.



          1. The KL divergence is a dissimilarity between two distributions, so it cannot maximize the likelihood, which is a function of a single distribution.

          2. Given a reference distribution $P(cdot)$, the value of $theta$ that minimizes $textKL(P(cdot)||Q(cdot|theta))$ is not the one that maximizes the likelihood. Actually, there is no likelihood because there is no observed value.

          So, saying that minimizing the KL divergence is equivalent to maximizing the log-likelihood can only mean that choosing $hattheta$ so as to maximize $Q(x_1, ldots, x_n|theta)$, ensures that $ hattheta rightarrow theta^*$, where



          $$theta^* = textargmin_theta text KL(P(cdot)||Q(cdot|theta)).$$



          This is true under some usual regularity conditions. To see this, assume that we compute $Q(x_1, ldots, x_n|theta)$, but the sample $x_1, ldots, x_n$ is actually drawn from $P(cdot)$. The expected value of the log-likelihood is then



          $$int P(x_1, ldots, x_n) log Q(x_1, ldots, x_n|theta) dx_1 ldots dx_n.$$



          Maximizing this value with respect to $theta$ is he same as minimizing



          $$textKL(P(cdot)||Q(cdot|theta)) = int P(x_1, ldots, x_n) log fracP(x_1, ldots, x_n)theta)dx_1 ldots dx_n.$$



          This is not an actual proof, but this gives you the main idea. Now, there is no reason why $theta^*$ should also minimize



          $$textKL(Q(cdot|theta)||P(cdot)) = int Q(x_1, ldots, x_n|theta) log fractheta)P(x_1, ldots, x_n)dx_1 ldots dx_n.$$



          Your question actually provides a counter-example of this, so it is clear that the value of $theta$ that minimizes the reverse KL divergence is in general not the same as the maximum likelihood estimate (and thus the same goes for the Jensen-Shannon divergence).



          What those values minimize is not so well defined. From the argument above, you can see that the minimum of the reverse KL divergence corresponds to computing the likelihood as $P(x_1, ldots, x_n)$ when $x_1, ldots, x_n$ is actually drawn from $Q(cdot|theta)$, while trying to keep the entropy of $Q(cdot|theta)$ as high as possible. The interpretation is not straightforward, but we can think of it as trying to find a "simple" distribution $Q(cdot|theta)$ that would "explain" the observations $x_1, ldots, x_n$ coming from a more complex distribution $P(cdot)$. This is a typical task of variational inference.



          The Jensen-Shannon divergence is the average of the two, so one can think of finding a minimum as "a little bit of both", meaning something in between the maximum likelihood estimate and a "simple explanation" for the data.






          share|cite|improve this answer











          $endgroup$








          • 1




            $begingroup$
            Thanks for your highly descriptive and informative answer. I am a little confused though by your last two sentences. If you look at the very first figure on arxiv.org/abs/1511.01844 you can see that the KLD and the JSD converge to different solutions so their optimal objectives can't be the same. I.e. they can't both be equivalent.
            $endgroup$
            – Mellow
            Apr 27 at 11:19







          • 1




            $begingroup$
            Hi @gui11aume, I have updated my original post to add the figure.
            $endgroup$
            – Mellow
            Apr 27 at 11:54






          • 1




            $begingroup$
            OK, I see your point. My answer is incomplete because it assumes that we have the same family of distributions in the KL divergence (in line with the blog post you linked to). I will update the answer.
            $endgroup$
            – gui11aume
            Apr 27 at 12:01






          • 1




            $begingroup$
            Thanks and btw your blog looks quite cool
            $endgroup$
            – Mellow
            Apr 27 at 12:17






          • 1




            $begingroup$
            Thanks for your updated awesome answer. I think there's a small typo - I think the reverse KL tries to increase the entropy of Q, as $KL(Q||P)$ = $E_Q[log Q] - E_Q[log P]$ = $-H[Q] - E_Q[log P]$. So minimising the KL would maximise the entropy of Q
            $endgroup$
            – Mellow
            2 days ago
















          4












          $begingroup$

          First, it is important to clarify a few things.



          1. The KL divergence is a dissimilarity between two distributions, so it cannot maximize the likelihood, which is a function of a single distribution.

          2. Given a reference distribution $P(cdot)$, the value of $theta$ that minimizes $textKL(P(cdot)||Q(cdot|theta))$ is not the one that maximizes the likelihood. Actually, there is no likelihood because there is no observed value.

          So, saying that minimizing the KL divergence is equivalent to maximizing the log-likelihood can only mean that choosing $hattheta$ so as to maximize $Q(x_1, ldots, x_n|theta)$, ensures that $ hattheta rightarrow theta^*$, where



          $$theta^* = textargmin_theta text KL(P(cdot)||Q(cdot|theta)).$$



          This is true under some usual regularity conditions. To see this, assume that we compute $Q(x_1, ldots, x_n|theta)$, but the sample $x_1, ldots, x_n$ is actually drawn from $P(cdot)$. The expected value of the log-likelihood is then



          $$int P(x_1, ldots, x_n) log Q(x_1, ldots, x_n|theta) dx_1 ldots dx_n.$$



          Maximizing this value with respect to $theta$ is he same as minimizing



          $$textKL(P(cdot)||Q(cdot|theta)) = int P(x_1, ldots, x_n) log fracP(x_1, ldots, x_n)theta)dx_1 ldots dx_n.$$



          This is not an actual proof, but this gives you the main idea. Now, there is no reason why $theta^*$ should also minimize



          $$textKL(Q(cdot|theta)||P(cdot)) = int Q(x_1, ldots, x_n|theta) log fractheta)P(x_1, ldots, x_n)dx_1 ldots dx_n.$$



          Your question actually provides a counter-example of this, so it is clear that the value of $theta$ that minimizes the reverse KL divergence is in general not the same as the maximum likelihood estimate (and thus the same goes for the Jensen-Shannon divergence).



          What those values minimize is not so well defined. From the argument above, you can see that the minimum of the reverse KL divergence corresponds to computing the likelihood as $P(x_1, ldots, x_n)$ when $x_1, ldots, x_n$ is actually drawn from $Q(cdot|theta)$, while trying to keep the entropy of $Q(cdot|theta)$ as high as possible. The interpretation is not straightforward, but we can think of it as trying to find a "simple" distribution $Q(cdot|theta)$ that would "explain" the observations $x_1, ldots, x_n$ coming from a more complex distribution $P(cdot)$. This is a typical task of variational inference.



          The Jensen-Shannon divergence is the average of the two, so one can think of finding a minimum as "a little bit of both", meaning something in between the maximum likelihood estimate and a "simple explanation" for the data.






          share|cite|improve this answer











          $endgroup$








          • 1




            $begingroup$
            Thanks for your highly descriptive and informative answer. I am a little confused though by your last two sentences. If you look at the very first figure on arxiv.org/abs/1511.01844 you can see that the KLD and the JSD converge to different solutions so their optimal objectives can't be the same. I.e. they can't both be equivalent.
            $endgroup$
            – Mellow
            Apr 27 at 11:19







          • 1




            $begingroup$
            Hi @gui11aume, I have updated my original post to add the figure.
            $endgroup$
            – Mellow
            Apr 27 at 11:54






          • 1




            $begingroup$
            OK, I see your point. My answer is incomplete because it assumes that we have the same family of distributions in the KL divergence (in line with the blog post you linked to). I will update the answer.
            $endgroup$
            – gui11aume
            Apr 27 at 12:01






          • 1




            $begingroup$
            Thanks and btw your blog looks quite cool
            $endgroup$
            – Mellow
            Apr 27 at 12:17






          • 1




            $begingroup$
            Thanks for your updated awesome answer. I think there's a small typo - I think the reverse KL tries to increase the entropy of Q, as $KL(Q||P)$ = $E_Q[log Q] - E_Q[log P]$ = $-H[Q] - E_Q[log P]$. So minimising the KL would maximise the entropy of Q
            $endgroup$
            – Mellow
            2 days ago














          4












          4








          4





          $begingroup$

          First, it is important to clarify a few things.



          1. The KL divergence is a dissimilarity between two distributions, so it cannot maximize the likelihood, which is a function of a single distribution.

          2. Given a reference distribution $P(cdot)$, the value of $theta$ that minimizes $textKL(P(cdot)||Q(cdot|theta))$ is not the one that maximizes the likelihood. Actually, there is no likelihood because there is no observed value.

          So, saying that minimizing the KL divergence is equivalent to maximizing the log-likelihood can only mean that choosing $hattheta$ so as to maximize $Q(x_1, ldots, x_n|theta)$, ensures that $ hattheta rightarrow theta^*$, where



          $$theta^* = textargmin_theta text KL(P(cdot)||Q(cdot|theta)).$$



          This is true under some usual regularity conditions. To see this, assume that we compute $Q(x_1, ldots, x_n|theta)$, but the sample $x_1, ldots, x_n$ is actually drawn from $P(cdot)$. The expected value of the log-likelihood is then



          $$int P(x_1, ldots, x_n) log Q(x_1, ldots, x_n|theta) dx_1 ldots dx_n.$$



          Maximizing this value with respect to $theta$ is he same as minimizing



          $$textKL(P(cdot)||Q(cdot|theta)) = int P(x_1, ldots, x_n) log fracP(x_1, ldots, x_n)theta)dx_1 ldots dx_n.$$



          This is not an actual proof, but this gives you the main idea. Now, there is no reason why $theta^*$ should also minimize



          $$textKL(Q(cdot|theta)||P(cdot)) = int Q(x_1, ldots, x_n|theta) log fractheta)P(x_1, ldots, x_n)dx_1 ldots dx_n.$$



          Your question actually provides a counter-example of this, so it is clear that the value of $theta$ that minimizes the reverse KL divergence is in general not the same as the maximum likelihood estimate (and thus the same goes for the Jensen-Shannon divergence).



          What those values minimize is not so well defined. From the argument above, you can see that the minimum of the reverse KL divergence corresponds to computing the likelihood as $P(x_1, ldots, x_n)$ when $x_1, ldots, x_n$ is actually drawn from $Q(cdot|theta)$, while trying to keep the entropy of $Q(cdot|theta)$ as high as possible. The interpretation is not straightforward, but we can think of it as trying to find a "simple" distribution $Q(cdot|theta)$ that would "explain" the observations $x_1, ldots, x_n$ coming from a more complex distribution $P(cdot)$. This is a typical task of variational inference.



          The Jensen-Shannon divergence is the average of the two, so one can think of finding a minimum as "a little bit of both", meaning something in between the maximum likelihood estimate and a "simple explanation" for the data.






          share|cite|improve this answer











          $endgroup$



          First, it is important to clarify a few things.



          1. The KL divergence is a dissimilarity between two distributions, so it cannot maximize the likelihood, which is a function of a single distribution.

          2. Given a reference distribution $P(cdot)$, the value of $theta$ that minimizes $textKL(P(cdot)||Q(cdot|theta))$ is not the one that maximizes the likelihood. Actually, there is no likelihood because there is no observed value.

          So, saying that minimizing the KL divergence is equivalent to maximizing the log-likelihood can only mean that choosing $hattheta$ so as to maximize $Q(x_1, ldots, x_n|theta)$, ensures that $ hattheta rightarrow theta^*$, where



          $$theta^* = textargmin_theta text KL(P(cdot)||Q(cdot|theta)).$$



          This is true under some usual regularity conditions. To see this, assume that we compute $Q(x_1, ldots, x_n|theta)$, but the sample $x_1, ldots, x_n$ is actually drawn from $P(cdot)$. The expected value of the log-likelihood is then



          $$int P(x_1, ldots, x_n) log Q(x_1, ldots, x_n|theta) dx_1 ldots dx_n.$$



          Maximizing this value with respect to $theta$ is he same as minimizing



          $$textKL(P(cdot)||Q(cdot|theta)) = int P(x_1, ldots, x_n) log fracP(x_1, ldots, x_n)theta)dx_1 ldots dx_n.$$



          This is not an actual proof, but this gives you the main idea. Now, there is no reason why $theta^*$ should also minimize



          $$textKL(Q(cdot|theta)||P(cdot)) = int Q(x_1, ldots, x_n|theta) log fractheta)P(x_1, ldots, x_n)dx_1 ldots dx_n.$$



          Your question actually provides a counter-example of this, so it is clear that the value of $theta$ that minimizes the reverse KL divergence is in general not the same as the maximum likelihood estimate (and thus the same goes for the Jensen-Shannon divergence).



          What those values minimize is not so well defined. From the argument above, you can see that the minimum of the reverse KL divergence corresponds to computing the likelihood as $P(x_1, ldots, x_n)$ when $x_1, ldots, x_n$ is actually drawn from $Q(cdot|theta)$, while trying to keep the entropy of $Q(cdot|theta)$ as high as possible. The interpretation is not straightforward, but we can think of it as trying to find a "simple" distribution $Q(cdot|theta)$ that would "explain" the observations $x_1, ldots, x_n$ coming from a more complex distribution $P(cdot)$. This is a typical task of variational inference.



          The Jensen-Shannon divergence is the average of the two, so one can think of finding a minimum as "a little bit of both", meaning something in between the maximum likelihood estimate and a "simple explanation" for the data.







          share|cite|improve this answer














          share|cite|improve this answer



          share|cite|improve this answer








          edited 2 days ago

























          answered Apr 27 at 11:10









          gui11aumegui11aume

          11.2k23684




          11.2k23684







          • 1




            $begingroup$
            Thanks for your highly descriptive and informative answer. I am a little confused though by your last two sentences. If you look at the very first figure on arxiv.org/abs/1511.01844 you can see that the KLD and the JSD converge to different solutions so their optimal objectives can't be the same. I.e. they can't both be equivalent.
            $endgroup$
            – Mellow
            Apr 27 at 11:19







          • 1




            $begingroup$
            Hi @gui11aume, I have updated my original post to add the figure.
            $endgroup$
            – Mellow
            Apr 27 at 11:54






          • 1




            $begingroup$
            OK, I see your point. My answer is incomplete because it assumes that we have the same family of distributions in the KL divergence (in line with the blog post you linked to). I will update the answer.
            $endgroup$
            – gui11aume
            Apr 27 at 12:01






          • 1




            $begingroup$
            Thanks and btw your blog looks quite cool
            $endgroup$
            – Mellow
            Apr 27 at 12:17






          • 1




            $begingroup$
            Thanks for your updated awesome answer. I think there's a small typo - I think the reverse KL tries to increase the entropy of Q, as $KL(Q||P)$ = $E_Q[log Q] - E_Q[log P]$ = $-H[Q] - E_Q[log P]$. So minimising the KL would maximise the entropy of Q
            $endgroup$
            – Mellow
            2 days ago













          • 1




            $begingroup$
            Thanks for your highly descriptive and informative answer. I am a little confused though by your last two sentences. If you look at the very first figure on arxiv.org/abs/1511.01844 you can see that the KLD and the JSD converge to different solutions so their optimal objectives can't be the same. I.e. they can't both be equivalent.
            $endgroup$
            – Mellow
            Apr 27 at 11:19







          • 1




            $begingroup$
            Hi @gui11aume, I have updated my original post to add the figure.
            $endgroup$
            – Mellow
            Apr 27 at 11:54






          • 1




            $begingroup$
            OK, I see your point. My answer is incomplete because it assumes that we have the same family of distributions in the KL divergence (in line with the blog post you linked to). I will update the answer.
            $endgroup$
            – gui11aume
            Apr 27 at 12:01






          • 1




            $begingroup$
            Thanks and btw your blog looks quite cool
            $endgroup$
            – Mellow
            Apr 27 at 12:17






          • 1




            $begingroup$
            Thanks for your updated awesome answer. I think there's a small typo - I think the reverse KL tries to increase the entropy of Q, as $KL(Q||P)$ = $E_Q[log Q] - E_Q[log P]$ = $-H[Q] - E_Q[log P]$. So minimising the KL would maximise the entropy of Q
            $endgroup$
            – Mellow
            2 days ago








          1




          1




          $begingroup$
          Thanks for your highly descriptive and informative answer. I am a little confused though by your last two sentences. If you look at the very first figure on arxiv.org/abs/1511.01844 you can see that the KLD and the JSD converge to different solutions so their optimal objectives can't be the same. I.e. they can't both be equivalent.
          $endgroup$
          – Mellow
          Apr 27 at 11:19





          $begingroup$
          Thanks for your highly descriptive and informative answer. I am a little confused though by your last two sentences. If you look at the very first figure on arxiv.org/abs/1511.01844 you can see that the KLD and the JSD converge to different solutions so their optimal objectives can't be the same. I.e. they can't both be equivalent.
          $endgroup$
          – Mellow
          Apr 27 at 11:19





          1




          1




          $begingroup$
          Hi @gui11aume, I have updated my original post to add the figure.
          $endgroup$
          – Mellow
          Apr 27 at 11:54




          $begingroup$
          Hi @gui11aume, I have updated my original post to add the figure.
          $endgroup$
          – Mellow
          Apr 27 at 11:54




          1




          1




          $begingroup$
          OK, I see your point. My answer is incomplete because it assumes that we have the same family of distributions in the KL divergence (in line with the blog post you linked to). I will update the answer.
          $endgroup$
          – gui11aume
          Apr 27 at 12:01




          $begingroup$
          OK, I see your point. My answer is incomplete because it assumes that we have the same family of distributions in the KL divergence (in line with the blog post you linked to). I will update the answer.
          $endgroup$
          – gui11aume
          Apr 27 at 12:01




          1




          1




          $begingroup$
          Thanks and btw your blog looks quite cool
          $endgroup$
          – Mellow
          Apr 27 at 12:17




          $begingroup$
          Thanks and btw your blog looks quite cool
          $endgroup$
          – Mellow
          Apr 27 at 12:17




          1




          1




          $begingroup$
          Thanks for your updated awesome answer. I think there's a small typo - I think the reverse KL tries to increase the entropy of Q, as $KL(Q||P)$ = $E_Q[log Q] - E_Q[log P]$ = $-H[Q] - E_Q[log P]$. So minimising the KL would maximise the entropy of Q
          $endgroup$
          – Mellow
          2 days ago





          $begingroup$
          Thanks for your updated awesome answer. I think there's a small typo - I think the reverse KL tries to increase the entropy of Q, as $KL(Q||P)$ = $E_Q[log Q] - E_Q[log P]$ = $-H[Q] - E_Q[log P]$. So minimising the KL would maximise the entropy of Q
          $endgroup$
          – Mellow
          2 days ago


















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Cross Validated!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          Use MathJax to format equations. MathJax reference.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f405355%2fdoes-the-jensen-shannon-divergence-maximise-likelihood%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown