Is there a real life meaning about KMeans error?How to determine whether a bad performance is caused by data quality?How to select regression algorithm for noisy (scattered) data?Machine Learning - Choice of features for determining hypothesisWhat are the differences between logistic and linear regression?Difference between Gradient Descent and Normal Equation in Linear RegressionWhat is the best statistical measure tool or approach to resolve my problem?Linear regression load model doesn't predict as expectedAny efficient way to build non-linear regression model for polynomial features?Predicting house price using linear regressionWhat does embedding mean in machine learning?

Can I activate an iPhone without an Apple ID?

Chess Construction Challenge #2-Check!

Connect neutrals together in 3-gang box (load side) with 3x 3-way switches?

Does the Entangle spell require existing vegetation?

Possible isometry groups of open manifolds

I quit, and boss offered me 3 month "grace period" where I could still come back

I do not have power to all my breakers

Is killing off one of my queer characters homophobic?

Construct a pentagon avoiding compass use

Asking for higher salary after I increased my initial figure

Krazy language in Krazy Kat, 25 July 1936

How to make "plastic" sounding distored guitar

Are lithium batteries allowed in the International Space Station?

What impact would a dragon the size of Asia have on the environment?

Why does the trade federation become so alarmed upon learning the ambassadors are Jedi Knights?

Hot object in a vacuum

Commutator subgroup of Heisenberg group.

How long do Apple retain notifications to be pushed to iOS devices until they expire?

(algebraic topology) question about the cellular approximation theorem

How to fit a linear model in the Bayesian way in Mathematica?

Is a public company able to check out who owns its shares in very detailed format?

Crab Nebula short story from 1960s or '70s

Are L-functions uniquely determined by their values at negative integers?

Mistakenly modified `/bin/sh'



Is there a real life meaning about KMeans error?


How to determine whether a bad performance is caused by data quality?How to select regression algorithm for noisy (scattered) data?Machine Learning - Choice of features for determining hypothesisWhat are the differences between logistic and linear regression?Difference between Gradient Descent and Normal Equation in Linear RegressionWhat is the best statistical measure tool or approach to resolve my problem?Linear regression load model doesn't predict as expectedAny efficient way to build non-linear regression model for polynomial features?Predicting house price using linear regressionWhat does embedding mean in machine learning?






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








2












$begingroup$


I am trying to understand the meaning of error in sklearn KMeans.



In the context of house pricing prediction, the error linear regression could be considered as the money difference per square foot.



Is there a real life meaning about KMeans error?










share|improve this question











$endgroup$


















    2












    $begingroup$


    I am trying to understand the meaning of error in sklearn KMeans.



    In the context of house pricing prediction, the error linear regression could be considered as the money difference per square foot.



    Is there a real life meaning about KMeans error?










    share|improve this question











    $endgroup$














      2












      2








      2





      $begingroup$


      I am trying to understand the meaning of error in sklearn KMeans.



      In the context of house pricing prediction, the error linear regression could be considered as the money difference per square foot.



      Is there a real life meaning about KMeans error?










      share|improve this question











      $endgroup$




      I am trying to understand the meaning of error in sklearn KMeans.



      In the context of house pricing prediction, the error linear regression could be considered as the money difference per square foot.



      Is there a real life meaning about KMeans error?







      machine-learning data-mining k-means






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Jul 7 at 10:21







      baojieqh

















      asked Jul 6 at 14:02









      baojieqhbaojieqh

      1996 bronze badges




      1996 bronze badges




















          2 Answers
          2






          active

          oldest

          votes


















          3












          $begingroup$

          The K-means Error gives you, what is known as, total intra-cluster variance. K-means error



          Intra-cluster variance is the measure of how much are the points in a given cluster spread.



          The following cluster will have high intra-cluster variancesparse cluster



          In the image below, even though the number of points are same as that of the image above, the points are densely distributed and hence will have lower intra-cluster variance.
          enter image description here



          K-means Error is interested in the total of such individual cluster variances.



          Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.



          Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.




          Code used for cluster plot generation -



          import numpy as np
          from matplotlib import pyplot as plt

          sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
          plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
          axes = plt.gca()
          axes.set_xlim(-1000, 1000)
          axes.set_ylim(-1000, 1000)
          plt.show()

          dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
          plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
          axes = plt.gca()
          axes.set_xlim(-1000, 1000)
          axes.set_ylim(-1000, 1000)
          plt.show()


          In both cases, a 1000 datapoints from a Bivariate Normal Distribution are sampled and plotted . In the second case, the Covariance Matrix is changed to plot a denser cluster. np.random.multivariate_normal's documentation can be found here. Hope this helps!






          share|improve this answer











          $endgroup$












          • $begingroup$
            thanks for your excellent explanation. do you mind post your code for the 2 figures?
            $endgroup$
            – baojieqh
            Jul 7 at 10:25






          • 1




            $begingroup$
            Edited the answer to include the code used for plotting :)
            $endgroup$
            – Yash Jakhotiya
            Jul 7 at 12:25


















          0












          $begingroup$

          The meaning of "error" in k-Means clustering is: how much discrepancy / loss of information would you get if you substitute the k centroids to the actual observations. In other words: how good the k centroids approximate your data.



          There are several ways you can measure this "error". Usually the percentage of variance explained or the within cluster sum of errors are employed, but the choice is huge. Even a more banal Euclidean distance could work.






          share|improve this answer









          $endgroup$















            Your Answer








            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "557"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f55179%2fis-there-a-real-life-meaning-about-kmeans-error%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            3












            $begingroup$

            The K-means Error gives you, what is known as, total intra-cluster variance. K-means error



            Intra-cluster variance is the measure of how much are the points in a given cluster spread.



            The following cluster will have high intra-cluster variancesparse cluster



            In the image below, even though the number of points are same as that of the image above, the points are densely distributed and hence will have lower intra-cluster variance.
            enter image description here



            K-means Error is interested in the total of such individual cluster variances.



            Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.



            Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.




            Code used for cluster plot generation -



            import numpy as np
            from matplotlib import pyplot as plt

            sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
            plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
            axes = plt.gca()
            axes.set_xlim(-1000, 1000)
            axes.set_ylim(-1000, 1000)
            plt.show()

            dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
            plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
            axes = plt.gca()
            axes.set_xlim(-1000, 1000)
            axes.set_ylim(-1000, 1000)
            plt.show()


            In both cases, a 1000 datapoints from a Bivariate Normal Distribution are sampled and plotted . In the second case, the Covariance Matrix is changed to plot a denser cluster. np.random.multivariate_normal's documentation can be found here. Hope this helps!






            share|improve this answer











            $endgroup$












            • $begingroup$
              thanks for your excellent explanation. do you mind post your code for the 2 figures?
              $endgroup$
              – baojieqh
              Jul 7 at 10:25






            • 1




              $begingroup$
              Edited the answer to include the code used for plotting :)
              $endgroup$
              – Yash Jakhotiya
              Jul 7 at 12:25















            3












            $begingroup$

            The K-means Error gives you, what is known as, total intra-cluster variance. K-means error



            Intra-cluster variance is the measure of how much are the points in a given cluster spread.



            The following cluster will have high intra-cluster variancesparse cluster



            In the image below, even though the number of points are same as that of the image above, the points are densely distributed and hence will have lower intra-cluster variance.
            enter image description here



            K-means Error is interested in the total of such individual cluster variances.



            Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.



            Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.




            Code used for cluster plot generation -



            import numpy as np
            from matplotlib import pyplot as plt

            sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
            plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
            axes = plt.gca()
            axes.set_xlim(-1000, 1000)
            axes.set_ylim(-1000, 1000)
            plt.show()

            dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
            plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
            axes = plt.gca()
            axes.set_xlim(-1000, 1000)
            axes.set_ylim(-1000, 1000)
            plt.show()


            In both cases, a 1000 datapoints from a Bivariate Normal Distribution are sampled and plotted . In the second case, the Covariance Matrix is changed to plot a denser cluster. np.random.multivariate_normal's documentation can be found here. Hope this helps!






            share|improve this answer











            $endgroup$












            • $begingroup$
              thanks for your excellent explanation. do you mind post your code for the 2 figures?
              $endgroup$
              – baojieqh
              Jul 7 at 10:25






            • 1




              $begingroup$
              Edited the answer to include the code used for plotting :)
              $endgroup$
              – Yash Jakhotiya
              Jul 7 at 12:25













            3












            3








            3





            $begingroup$

            The K-means Error gives you, what is known as, total intra-cluster variance. K-means error



            Intra-cluster variance is the measure of how much are the points in a given cluster spread.



            The following cluster will have high intra-cluster variancesparse cluster



            In the image below, even though the number of points are same as that of the image above, the points are densely distributed and hence will have lower intra-cluster variance.
            enter image description here



            K-means Error is interested in the total of such individual cluster variances.



            Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.



            Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.




            Code used for cluster plot generation -



            import numpy as np
            from matplotlib import pyplot as plt

            sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
            plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
            axes = plt.gca()
            axes.set_xlim(-1000, 1000)
            axes.set_ylim(-1000, 1000)
            plt.show()

            dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
            plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
            axes = plt.gca()
            axes.set_xlim(-1000, 1000)
            axes.set_ylim(-1000, 1000)
            plt.show()


            In both cases, a 1000 datapoints from a Bivariate Normal Distribution are sampled and plotted . In the second case, the Covariance Matrix is changed to plot a denser cluster. np.random.multivariate_normal's documentation can be found here. Hope this helps!






            share|improve this answer











            $endgroup$



            The K-means Error gives you, what is known as, total intra-cluster variance. K-means error



            Intra-cluster variance is the measure of how much are the points in a given cluster spread.



            The following cluster will have high intra-cluster variancesparse cluster



            In the image below, even though the number of points are same as that of the image above, the points are densely distributed and hence will have lower intra-cluster variance.
            enter image description here



            K-means Error is interested in the total of such individual cluster variances.



            Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.



            Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.




            Code used for cluster plot generation -



            import numpy as np
            from matplotlib import pyplot as plt

            sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
            plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
            axes = plt.gca()
            axes.set_xlim(-1000, 1000)
            axes.set_ylim(-1000, 1000)
            plt.show()

            dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
            plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
            axes = plt.gca()
            axes.set_xlim(-1000, 1000)
            axes.set_ylim(-1000, 1000)
            plt.show()


            In both cases, a 1000 datapoints from a Bivariate Normal Distribution are sampled and plotted . In the second case, the Covariance Matrix is changed to plot a denser cluster. np.random.multivariate_normal's documentation can be found here. Hope this helps!







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Jul 7 at 12:24

























            answered Jul 6 at 15:37









            Yash JakhotiyaYash Jakhotiya

            965 bronze badges




            965 bronze badges











            • $begingroup$
              thanks for your excellent explanation. do you mind post your code for the 2 figures?
              $endgroup$
              – baojieqh
              Jul 7 at 10:25






            • 1




              $begingroup$
              Edited the answer to include the code used for plotting :)
              $endgroup$
              – Yash Jakhotiya
              Jul 7 at 12:25
















            • $begingroup$
              thanks for your excellent explanation. do you mind post your code for the 2 figures?
              $endgroup$
              – baojieqh
              Jul 7 at 10:25






            • 1




              $begingroup$
              Edited the answer to include the code used for plotting :)
              $endgroup$
              – Yash Jakhotiya
              Jul 7 at 12:25















            $begingroup$
            thanks for your excellent explanation. do you mind post your code for the 2 figures?
            $endgroup$
            – baojieqh
            Jul 7 at 10:25




            $begingroup$
            thanks for your excellent explanation. do you mind post your code for the 2 figures?
            $endgroup$
            – baojieqh
            Jul 7 at 10:25




            1




            1




            $begingroup$
            Edited the answer to include the code used for plotting :)
            $endgroup$
            – Yash Jakhotiya
            Jul 7 at 12:25




            $begingroup$
            Edited the answer to include the code used for plotting :)
            $endgroup$
            – Yash Jakhotiya
            Jul 7 at 12:25













            0












            $begingroup$

            The meaning of "error" in k-Means clustering is: how much discrepancy / loss of information would you get if you substitute the k centroids to the actual observations. In other words: how good the k centroids approximate your data.



            There are several ways you can measure this "error". Usually the percentage of variance explained or the within cluster sum of errors are employed, but the choice is huge. Even a more banal Euclidean distance could work.






            share|improve this answer









            $endgroup$

















              0












              $begingroup$

              The meaning of "error" in k-Means clustering is: how much discrepancy / loss of information would you get if you substitute the k centroids to the actual observations. In other words: how good the k centroids approximate your data.



              There are several ways you can measure this "error". Usually the percentage of variance explained or the within cluster sum of errors are employed, but the choice is huge. Even a more banal Euclidean distance could work.






              share|improve this answer









              $endgroup$















                0












                0








                0





                $begingroup$

                The meaning of "error" in k-Means clustering is: how much discrepancy / loss of information would you get if you substitute the k centroids to the actual observations. In other words: how good the k centroids approximate your data.



                There are several ways you can measure this "error". Usually the percentage of variance explained or the within cluster sum of errors are employed, but the choice is huge. Even a more banal Euclidean distance could work.






                share|improve this answer









                $endgroup$



                The meaning of "error" in k-Means clustering is: how much discrepancy / loss of information would you get if you substitute the k centroids to the actual observations. In other words: how good the k centroids approximate your data.



                There are several ways you can measure this "error". Usually the percentage of variance explained or the within cluster sum of errors are employed, but the choice is huge. Even a more banal Euclidean distance could work.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Jul 6 at 15:35









                LeevoLeevo

                8701 silver badge14 bronze badges




                8701 silver badge14 bronze badges



























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Data Science Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    Use MathJax to format equations. MathJax reference.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f55179%2fis-there-a-real-life-meaning-about-kmeans-error%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Category:9 (number) SubcategoriesMedia in category "9 (number)"Navigation menuUpload mediaGND ID: 4485639-8Library of Congress authority ID: sh85091979ReasonatorScholiaStatistics

                    Circuit construction for execution of conditional statements using least significant bitHow are two different registers being used as “control”?How exactly is the stated composite state of the two registers being produced using the $R_zz$ controlled rotations?Efficiently performing controlled rotations in HHLWould this quantum algorithm implementation work?How to prepare a superposed states of odd integers from $1$ to $sqrtN$?Why is this implementation of the order finding algorithm not working?Circuit construction for Hamiltonian simulationHow can I invert the least significant bit of a certain term of a superposed state?Implementing an oracleImplementing a controlled sum operation

                    Magento 2 “No Payment Methods” in Admin New OrderHow to integrate Paypal Express Checkout with the Magento APIMagento 1.5 - Sales > Order > edit order and shipping methods disappearAuto Invoice Check/Money Order Payment methodAdd more simple payment methods?Shipping methods not showingWhat should I do to change payment methods if changing the configuration has no effects?1.9 - No Payment Methods showing upMy Payment Methods not Showing for downloadable/virtual product when checkout?Magento2 API to access internal payment methodHow to call an existing payment methods in the registration form?