Is there a real life meaning about KMeans error?How to determine whether a bad performance is caused by data quality?How to select regression algorithm for noisy (scattered) data?Machine Learning - Choice of features for determining hypothesisWhat are the differences between logistic and linear regression?Difference between Gradient Descent and Normal Equation in Linear RegressionWhat is the best statistical measure tool or approach to resolve my problem?Linear regression load model doesn't predict as expectedAny efficient way to build non-linear regression model for polynomial features?Predicting house price using linear regressionWhat does embedding mean in machine learning?

Can I activate an iPhone without an Apple ID?

Chess Construction Challenge #2-Check!

Connect neutrals together in 3-gang box (load side) with 3x 3-way switches?

Does the Entangle spell require existing vegetation?

Possible isometry groups of open manifolds

I quit, and boss offered me 3 month "grace period" where I could still come back

I do not have power to all my breakers

Is killing off one of my queer characters homophobic?

Construct a pentagon avoiding compass use

Asking for higher salary after I increased my initial figure

Krazy language in Krazy Kat, 25 July 1936

How to make "plastic" sounding distored guitar

Are lithium batteries allowed in the International Space Station?

What impact would a dragon the size of Asia have on the environment?

Why does the trade federation become so alarmed upon learning the ambassadors are Jedi Knights?

Hot object in a vacuum

Commutator subgroup of Heisenberg group.

How long do Apple retain notifications to be pushed to iOS devices until they expire?

(algebraic topology) question about the cellular approximation theorem

How to fit a linear model in the Bayesian way in Mathematica?

Is a public company able to check out who owns its shares in very detailed format?

Crab Nebula short story from 1960s or '70s

Are L-functions uniquely determined by their values at negative integers?

Mistakenly modified `/bin/sh'

Is there a real life meaning about KMeans error?

How to determine whether a bad performance is caused by data quality?How to select regression algorithm for noisy (scattered) data?Machine Learning - Choice of features for determining hypothesisWhat are the differences between logistic and linear regression?Difference between Gradient Descent and Normal Equation in Linear RegressionWhat is the best statistical measure tool or approach to resolve my problem?Linear regression load model doesn't predict as expectedAny efficient way to build non-linear regression model for polynomial features?Predicting house price using linear regressionWhat does embedding mean in machine learning?

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I am trying to understand the meaning of error in sklearn KMeans.

In the context of house pricing prediction, the error linear regression could be considered as the money difference per square foot.

Is there a real life meaning about KMeans error?

edited Jul 7 at 10:21

asked Jul 6 at 14:02

baojieqh

1996 bronze badges

add a comment |

I am trying to understand the meaning of error in sklearn KMeans.

In the context of house pricing prediction, the error linear regression could be considered as the money difference per square foot.

Is there a real life meaning about KMeans error?

edited Jul 7 at 10:21

asked Jul 6 at 14:02

baojieqh

1996 bronze badges

add a comment |

I am trying to understand the meaning of error in sklearn KMeans.

In the context of house pricing prediction, the error linear regression could be considered as the money difference per square foot.

Is there a real life meaning about KMeans error?

edited Jul 7 at 10:21

asked Jul 6 at 14:02

baojieqh

1996 bronze badges

I am trying to understand the meaning of error in sklearn KMeans.

In the context of house pricing prediction, the error linear regression could be considered as the money difference per square foot.

Is there a real life meaning about KMeans error?

machine-learning data-mining k-means

edited Jul 7 at 10:21

asked Jul 6 at 14:02

baojieqh

1996 bronze badges

edited Jul 7 at 10:21

asked Jul 6 at 14:02

baojieqh

1996 bronze badges

edited Jul 7 at 10:21

asked Jul 6 at 14:02

baojieqh

1996 bronze badges

asked Jul 6 at 14:02

baojieqh

1996 bronze badges

asked Jul 6 at 14:02

baojieqh

1996 bronze badges

add a comment |

2 Answers
2

active

oldest

votes

The K-means Error gives you, what is known as, total intra-cluster variance.

Intra-cluster variance is the measure of how much are the points in a given cluster spread.

The following cluster will have high intra-cluster variance sparse cluster

In the image below, even though the number of points are same as that of the image above, the points are densely distributed and hence will have lower intra-cluster variance.
enter image description here

K-means Error is interested in the total of such individual cluster variances.

Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.

Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.

Code used for cluster plot generation -

import numpy as np
from matplotlib import pyplot as plt

sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()

dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()

In both cases, a 1000 datapoints from a Bivariate Normal Distribution are sampled and plotted . In the second case, the Covariance Matrix is changed to plot a denser cluster. np.random.multivariate_normal's documentation can be found here. Hope this helps!

edited Jul 7 at 12:24

answered Jul 6 at 15:37

Yash Jakhotiya

965 bronze badges

$begingroup$
thanks for your excellent explanation. do you mind post your code for the 2 figures?
$endgroup$
– baojieqh
Jul 7 at 10:25

1

$begingroup$
Edited the answer to include the code used for plotting :)
$endgroup$
– Yash Jakhotiya
Jul 7 at 12:25

add a comment |

The meaning of "error" in k-Means clustering is: how much discrepancy / loss of information would you get if you substitute the k centroids to the actual observations. In other words: how good the k centroids approximate your data.

There are several ways you can measure this "error". Usually the percentage of variance explained or the within cluster sum of errors are employed, but the choice is huge. Even a more banal Euclidean distance could work.

answered Jul 6 at 15:35

Leevo

8701 silver badge14 bronze badges

add a comment |

Your Answer

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f55179%2fis-there-a-real-life-meaning-about-kmeans-error%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

The K-means Error gives you, what is known as, total intra-cluster variance.

Intra-cluster variance is the measure of how much are the points in a given cluster spread.

The following cluster will have high intra-cluster variance sparse cluster

K-means Error is interested in the total of such individual cluster variances.

Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.

Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.

Code used for cluster plot generation -

import numpy as np
from matplotlib import pyplot as plt

sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()

dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()

edited Jul 7 at 12:24

answered Jul 6 at 15:37

Yash Jakhotiya

965 bronze badges

$begingroup$
thanks for your excellent explanation. do you mind post your code for the 2 figures?
$endgroup$
– baojieqh
Jul 7 at 10:25

1

$begingroup$
Edited the answer to include the code used for plotting :)
$endgroup$
– Yash Jakhotiya
Jul 7 at 12:25

add a comment |

The K-means Error gives you, what is known as, total intra-cluster variance.

Intra-cluster variance is the measure of how much are the points in a given cluster spread.

The following cluster will have high intra-cluster variance sparse cluster

K-means Error is interested in the total of such individual cluster variances.

Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.

Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.

Code used for cluster plot generation -

import numpy as np
from matplotlib import pyplot as plt

sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()

dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()

edited Jul 7 at 12:24

answered Jul 6 at 15:37

Yash Jakhotiya

965 bronze badges

$begingroup$
thanks for your excellent explanation. do you mind post your code for the 2 figures?
$endgroup$
– baojieqh
Jul 7 at 10:25

1

$begingroup$
Edited the answer to include the code used for plotting :)
$endgroup$
– Yash Jakhotiya
Jul 7 at 12:25

add a comment |

The K-means Error gives you, what is known as, total intra-cluster variance.

Intra-cluster variance is the measure of how much are the points in a given cluster spread.

The following cluster will have high intra-cluster variance sparse cluster

K-means Error is interested in the total of such individual cluster variances.

Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.

Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.

Code used for cluster plot generation -

import numpy as np
from matplotlib import pyplot as plt

sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()

dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()

edited Jul 7 at 12:24

answered Jul 6 at 15:37

Yash Jakhotiya

965 bronze badges

The K-means Error gives you, what is known as, total intra-cluster variance.

Intra-cluster variance is the measure of how much are the points in a given cluster spread.

The following cluster will have high intra-cluster variance sparse cluster

K-means Error is interested in the total of such individual cluster variances.

Suppose for a given data, if clustering 'A' forms clusters like the first image and clustering 'B' forms clusters like the second image, you will in most cases choose the second one.

Although this does not mean that the K-means Error is a perfect objective to optimize on to form clusters. But it pretty much catches the essence behind clustering.

Code used for cluster plot generation -

import numpy as np
from matplotlib import pyplot as plt

sparse_samples = np.random.multivariate_normal([0, 0], [[50000, 0], [0, 50000]], size=(1000))
plt.plot(sparse_samples[:, 0], sparse_samples[:, 1], 'b+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()

dense_samples = np.random.multivariate_normal([0, 0], [[5000, 0], [0, 5000]], size=(1000))
plt.plot(dense_samples[:, 0], dense_samples[:, 1], 'r+')
axes = plt.gca()
axes.set_xlim(-1000, 1000)
axes.set_ylim(-1000, 1000)
plt.show()

edited Jul 7 at 12:24

answered Jul 6 at 15:37

Yash Jakhotiya

965 bronze badges

edited Jul 7 at 12:24

answered Jul 6 at 15:37

Yash Jakhotiya

965 bronze badges

answered Jul 6 at 15:37

Yash Jakhotiya

965 bronze badges

answered Jul 6 at 15:37

Yash Jakhotiya

965 bronze badges

$begingroup$
thanks for your excellent explanation. do you mind post your code for the 2 figures?
$endgroup$
– baojieqh
Jul 7 at 10:25

1

$begingroup$
Edited the answer to include the code used for plotting :)
$endgroup$
– Yash Jakhotiya
Jul 7 at 12:25

add a comment |

$begingroup$
thanks for your excellent explanation. do you mind post your code for the 2 figures?
$endgroup$
– baojieqh
Jul 7 at 10:25

1

$begingroup$
Edited the answer to include the code used for plotting :)
$endgroup$
– Yash Jakhotiya
Jul 7 at 12:25

thanks for your excellent explanation. do you mind post your code for the 2 figures?

– baojieqh
Jul 7 at 10:25

Edited the answer to include the code used for plotting :)

– Yash Jakhotiya
Jul 7 at 12:25

add a comment |

answered Jul 6 at 15:35

Leevo

8701 silver badge14 bronze badges

add a comment |

answered Jul 6 at 15:35

Leevo

8701 silver badge14 bronze badges

add a comment |

answered Jul 6 at 15:35

Leevo

8701 silver badge14 bronze badges

answered Jul 6 at 15:35

Leevo

8701 silver badge14 bronze badges

answered Jul 6 at 15:35

Leevo

8701 silver badge14 bronze badges

answered Jul 6 at 15:35

Leevo

8701 silver badge14 bronze badges

answered Jul 6 at 15:35

Leevo

8701 silver badge14 bronze badges

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Data Science Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

srA8ai 4J D,HAPjx PB6CfXrXJ0Gaseb,YZFI86W1vx3,JE tYTUYM4,J3e YJi iN ME0Mi

搜尋此網誌

Ttdfjt

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

2 Answers
2

2 Answers
2

2 Answers
2