Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[D] Unsupervised Clustering without knowing number of classes (alien.top)

submitted 2 years ago by BigBrainUrinal@alien.top to c/machinelearning@academy.garden

5 comments fedilink hide all child comments

Does anyone know where to find the best models for unsupervised clustering problems that don't specify the number classes? For example I googled unsupervised MNIST but IIC which holds the record requires the output dimension (k=10) to be specified? Is there a name for unsupervised clustering without knowing the number of classes? (I know of density/hierarchical clustering algorithms but am unaware of many deep learning ones) And specifically are results charted anywhere? I'm researching the topic and it seems knowing the number of things you're looking for is half the battle. I can find papers on methods that aim to find the number of clusters etc but are there any benchmarks to compare?

top 5 comments

sorted by: hot top controversial new old

[–] kduyehj@alien.top 1 points 2 years ago (1 children)

I had this issue. Tried hierarchical methods but fell back to K-means by using a two tier method. The first tier hunting for K=40 (the data likely had more). Then for each resulting cluster applied k-means again from k=0 to k=8 and then using some analytics techniques on the WCSS curve, decided if there was a decent knee in the curve and choose the appropriate k for each sub cluster. There are some complications using this method because the WCSS curve might be close to a straight line, so you conclude there are no sub clusters. Or it might not be monotonously decreasing in which case you might not have enough members in the cluster. As always, it depends on your data, the way you choose features, and how it’s embedded or tokenised. The final layer did some heavier tokenisation followed by de-duplication across all sub-clusters.

I don’t know how well the above description helps, but you might get some ideas.

[–] BigBrainUrinal@alien.top 1 points 2 years ago (1 children)

Is what youre saying finding 40clusters than searching for 1-8 within them = 40-320 possible clusters? I likely dont have that many events happening but interesting idea

[–] kduyehj@alien.top 1 points 2 years ago

If the particular data potentially has say 50 clusters, but using k-means if you ask for 40, then you will get 40 and then 1 to 10 of those could lend themselves to finding sub clusters. So the majority of the 40 clusters won’t exhibit a WCSS curve with a knee and therefore conclude they Are “good” clusters. (There’s a bit more to it than that by the way but this is part of the idea). In the lucky case this could be 39 good clusters where the remaining one is mixed up with things that don’t fit well. Maybe these are outliers or poorly represented in the input space. Or you might get up to 5 “nearly good” clusters where each have two sub clusters.

Of course if your input data only has say 20 clusters by whatever definition, then asking for 40 will incorrectly separate some data. This is why I then used some de-duplication.

You’d need to understand the distribution of your data and apply techniques that suit.

I’m not saying this approach is a general solution, it’s just an idea that worked out for me in my case. All I needed was a single representative from each cluster and it didn’t matter much if two or more of those should have been treated the same.

In my case, the initial (k=40) is a hyper-parameter, as is the choice to search for up to 8 sub clusters.

The graphs and analysis of the 2nd tier WCSS data give a reasonable measure of performance.

[–] visarga@alien.top 1 points 2 years ago

You already mentioned hierarchical methods but I got my best class-count agnostic clustering with fclusterdata from scipy:

labels = fclusterdata(data, t=threshold, criterion='distance')

[–] BigBayesian@alien.top 1 points 2 years ago

Check out model selection. There’s heuristic scores that can work okay - AIC, BIC.

Basically, it comes down to trading off quality of fit (distance from datapoints to cluster means) with complexity of model.