I had this issue. Tried hierarchical methods but fell back to K-means by using a two tier method. The first tier hunting for K=40 (the data likely had more). Then for each resulting cluster applied k-means again from k=0 to k=8 and then using some analytics techniques on the WCSS curve, decided if there was a decent knee in the curve and choose the appropriate k for each sub cluster. There are some complications using this method because the WCSS curve might be close to a straight line, so you conclude there are no sub clusters. Or it might not be monotonously decreasing in which case you might not have enough members in the cluster. As always, it depends on your data, the way you choose features, and how it’s embedded or tokenised. The final layer did some heavier tokenisation followed by de-duplication across all sub-clusters.
I don’t know how well the above description helps, but you might get some ideas.