Is there a real life meaning about KMeans error?How to determine whether a bad performance is caused by data quality?How to select regression algorithm for noisy (scattered) data?Machine Learning - Choice of features for determining hypothesisWhat are the differences between logistic and linear regression?Difference between Gradient Descent and Normal Equation in Linear RegressionWhat is the best statistical measure tool or approach to resolve my problem?Linear regression load model doesn't predict as expectedAny efficient way to build non-linear regression model for polynomial features?Predicting house price using linear regressionWhat does embedding mean in machine learning?
Can I activate an iPhone without an Apple ID?
Chess Construction Challenge #2-Check!
Connect neutrals together in 3-gang box (load side) with 3x 3-way switches?
Does the Entangle spell require existing vegetation?
Possible isometry groups of open manifolds
I quit, and boss offered me 3 month "grace period" where I could still come back
I do not have power to all my breakers
Is killing off one of my queer characters homophobic?
Construct a pentagon avoiding compass use
Asking for higher salary after I increased my initial figure
Krazy language in Krazy Kat, 25 July 1936
How to make "plastic" sounding distored guitar
Are lithium batteries allowed in the International Space Station?
What impact would a dragon the size of Asia have on the environment?
Why does the trade federation become so alarmed upon learning the ambassadors are Jedi Knights?
Hot object in a vacuum
Commutator subgroup of Heisenberg group.
How long do Apple retain notifications to be pushed to iOS devices until they expire?
(algebraic topology) question about the cellular approximation theorem
How to fit a linear model in the Bayesian way in Mathematica?
Is a public company able to check out who owns its shares in very detailed format?
Crab Nebula short story from 1960s or '70s
Are L-functions uniquely determined by their values at negative integers?
Mistakenly modified `/bin/sh'
Is there a real life meaning about KMeans error?
How to determine whether a bad performance is caused by data quality?How to select regression algorithm for noisy (scattered) data?Machine Learning - Choice of features for determining hypothesisWhat are the differences between logistic and linear regression?Difference between Gradient Descent and Normal Equation in Linear RegressionWhat is the best statistical measure tool or approach to resolve my problem?Linear regression load model doesn't predict as expectedAny efficient way to build non-linear regression model for polynomial features?Predicting house price using linear regressionWhat does embedding mean in machine learning?
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
I am trying to understand the meaning of error in sklearn KMeans.
In the context of house pricing prediction, the error linear regression could be considered as the money difference per square foot.
Is there a real life meaning about KMeans error?
machine-learning data-mining k-means
$endgroup$
add a comment |
$begingroup$
I am trying to understand the meaning of error in sklearn KMeans.
In the context of house pricing prediction, the error linear regression could be considered as the money difference per square foot.
Is there a real life meaning about KMeans error?
machine-learning data-mining k-means
$endgroup$
add a comment |
$begingroup$
I am trying to understand the meaning of error in sklearn KMeans.
In the context of house pricing prediction, the error linear regression could be considered as the money difference per square foot.
Is there a real life meaning about KMeans error?
machine-learning data-mining k-means
$endgroup$
I am trying to understand the meaning of error in sklearn KMeans.
In the context of house pricing prediction, the error linear regression could be considered as the money difference per square foot.
Is there a real life meaning about KMeans error?
machine-learning data-mining k-means
machine-learning data-mining k-means
edited Jul 7 at 10:21
baojieqh
asked Jul 6 at 14:02
baojieqhbaojieqh
1996 bronze badges
1996 bronze badges
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
$begingroup$
The K-means Error gives you, what is known as, total intra-cluster variance.
Intra-cluster variance is the measure of how much are the points in a given cluster spread.
The following cluster will have high intra-cluster variance
In the image below, even though the number of points are same as that of the image above, the points are densely distributed and hence will have lower intra-cluster variance.
K-means Error is interested in the total of such individual cluster variances.
Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.
Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.
Code used for cluster plot generation -
import numpy as np
from matplotlib import pyplot as plt
sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()
dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()
In both cases, a 1000
datapoints from a Bivariate Normal Distribution are sampled and plotted . In the second case, the Covariance Matrix is changed to plot a denser cluster. np.random.multivariate_normal
's documentation can be found here. Hope this helps!
$endgroup$
$begingroup$
thanks for your excellent explanation. do you mind post your code for the 2 figures?
$endgroup$
– baojieqh
Jul 7 at 10:25
1
$begingroup$
Edited the answer to include the code used for plotting :)
$endgroup$
– Yash Jakhotiya
Jul 7 at 12:25
add a comment |
$begingroup$
The meaning of "error" in k-Means clustering is: how much discrepancy / loss of information would you get if you substitute the k centroids to the actual observations. In other words: how good the k centroids approximate your data.
There are several ways you can measure this "error". Usually the percentage of variance explained or the within cluster sum of errors are employed, but the choice is huge. Even a more banal Euclidean distance could work.
$endgroup$
add a comment |
Your Answer
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f55179%2fis-there-a-real-life-meaning-about-kmeans-error%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
The K-means Error gives you, what is known as, total intra-cluster variance.
Intra-cluster variance is the measure of how much are the points in a given cluster spread.
The following cluster will have high intra-cluster variance
In the image below, even though the number of points are same as that of the image above, the points are densely distributed and hence will have lower intra-cluster variance.
K-means Error is interested in the total of such individual cluster variances.
Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.
Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.
Code used for cluster plot generation -
import numpy as np
from matplotlib import pyplot as plt
sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()
dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()
In both cases, a 1000
datapoints from a Bivariate Normal Distribution are sampled and plotted . In the second case, the Covariance Matrix is changed to plot a denser cluster. np.random.multivariate_normal
's documentation can be found here. Hope this helps!
$endgroup$
$begingroup$
thanks for your excellent explanation. do you mind post your code for the 2 figures?
$endgroup$
– baojieqh
Jul 7 at 10:25
1
$begingroup$
Edited the answer to include the code used for plotting :)
$endgroup$
– Yash Jakhotiya
Jul 7 at 12:25
add a comment |
$begingroup$
The K-means Error gives you, what is known as, total intra-cluster variance.
Intra-cluster variance is the measure of how much are the points in a given cluster spread.
The following cluster will have high intra-cluster variance
In the image below, even though the number of points are same as that of the image above, the points are densely distributed and hence will have lower intra-cluster variance.
K-means Error is interested in the total of such individual cluster variances.
Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.
Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.
Code used for cluster plot generation -
import numpy as np
from matplotlib import pyplot as plt
sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()
dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()
In both cases, a 1000
datapoints from a Bivariate Normal Distribution are sampled and plotted . In the second case, the Covariance Matrix is changed to plot a denser cluster. np.random.multivariate_normal
's documentation can be found here. Hope this helps!
$endgroup$
$begingroup$
thanks for your excellent explanation. do you mind post your code for the 2 figures?
$endgroup$
– baojieqh
Jul 7 at 10:25
1
$begingroup$
Edited the answer to include the code used for plotting :)
$endgroup$
– Yash Jakhotiya
Jul 7 at 12:25
add a comment |
$begingroup$
The K-means Error gives you, what is known as, total intra-cluster variance.
Intra-cluster variance is the measure of how much are the points in a given cluster spread.
The following cluster will have high intra-cluster variance
In the image below, even though the number of points are same as that of the image above, the points are densely distributed and hence will have lower intra-cluster variance.
K-means Error is interested in the total of such individual cluster variances.
Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.
Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.
Code used for cluster plot generation -
import numpy as np
from matplotlib import pyplot as plt
sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()
dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()
In both cases, a 1000
datapoints from a Bivariate Normal Distribution are sampled and plotted . In the second case, the Covariance Matrix is changed to plot a denser cluster. np.random.multivariate_normal
's documentation can be found here. Hope this helps!
$endgroup$
The K-means Error gives you, what is known as, total intra-cluster variance.
Intra-cluster variance is the measure of how much are the points in a given cluster spread.
The following cluster will have high intra-cluster variance
In the image below, even though the number of points are same as that of the image above, the points are densely distributed and hence will have lower intra-cluster variance.
K-means Error is interested in the total of such individual cluster variances.
Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.
Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.
Code used for cluster plot generation -
import numpy as np
from matplotlib import pyplot as plt
sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()
dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()
In both cases, a 1000
datapoints from a Bivariate Normal Distribution are sampled and plotted . In the second case, the Covariance Matrix is changed to plot a denser cluster. np.random.multivariate_normal
's documentation can be found here. Hope this helps!
edited Jul 7 at 12:24
answered Jul 6 at 15:37
Yash JakhotiyaYash Jakhotiya
965 bronze badges
965 bronze badges
$begingroup$
thanks for your excellent explanation. do you mind post your code for the 2 figures?
$endgroup$
– baojieqh
Jul 7 at 10:25
1
$begingroup$
Edited the answer to include the code used for plotting :)
$endgroup$
– Yash Jakhotiya
Jul 7 at 12:25
add a comment |
$begingroup$
thanks for your excellent explanation. do you mind post your code for the 2 figures?
$endgroup$
– baojieqh
Jul 7 at 10:25
1
$begingroup$
Edited the answer to include the code used for plotting :)
$endgroup$
– Yash Jakhotiya
Jul 7 at 12:25
$begingroup$
thanks for your excellent explanation. do you mind post your code for the 2 figures?
$endgroup$
– baojieqh
Jul 7 at 10:25
$begingroup$
thanks for your excellent explanation. do you mind post your code for the 2 figures?
$endgroup$
– baojieqh
Jul 7 at 10:25
1
1
$begingroup$
Edited the answer to include the code used for plotting :)
$endgroup$
– Yash Jakhotiya
Jul 7 at 12:25
$begingroup$
Edited the answer to include the code used for plotting :)
$endgroup$
– Yash Jakhotiya
Jul 7 at 12:25
add a comment |
$begingroup$
The meaning of "error" in k-Means clustering is: how much discrepancy / loss of information would you get if you substitute the k centroids to the actual observations. In other words: how good the k centroids approximate your data.
There are several ways you can measure this "error". Usually the percentage of variance explained or the within cluster sum of errors are employed, but the choice is huge. Even a more banal Euclidean distance could work.
$endgroup$
add a comment |
$begingroup$
The meaning of "error" in k-Means clustering is: how much discrepancy / loss of information would you get if you substitute the k centroids to the actual observations. In other words: how good the k centroids approximate your data.
There are several ways you can measure this "error". Usually the percentage of variance explained or the within cluster sum of errors are employed, but the choice is huge. Even a more banal Euclidean distance could work.
$endgroup$
add a comment |
$begingroup$
The meaning of "error" in k-Means clustering is: how much discrepancy / loss of information would you get if you substitute the k centroids to the actual observations. In other words: how good the k centroids approximate your data.
There are several ways you can measure this "error". Usually the percentage of variance explained or the within cluster sum of errors are employed, but the choice is huge. Even a more banal Euclidean distance could work.
$endgroup$
The meaning of "error" in k-Means clustering is: how much discrepancy / loss of information would you get if you substitute the k centroids to the actual observations. In other words: how good the k centroids approximate your data.
There are several ways you can measure this "error". Usually the percentage of variance explained or the within cluster sum of errors are employed, but the choice is huge. Even a more banal Euclidean distance could work.
answered Jul 6 at 15:35
LeevoLeevo
8701 silver badge14 bronze badges
8701 silver badge14 bronze badges
add a comment |
add a comment |
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f55179%2fis-there-a-real-life-meaning-about-kmeans-error%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown