this post was submitted on 09 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 11 months ago
MODERATORS
 

Hi,

we all know that text embeddings (e.g., SBERT, simCSE, LLM embeddings) are very powerful. However, my little grudge with them was always that it's hard to say what's really in them. Okay, matching them gives some value of "relatedness" or "similarity", but the value is kind of really hard to interpret. I mean text can be really diverse and is often similar in some categories, but not in others.

Here's an example:

"The man builds a tent"

"Two men build a tent"

A text embedding model such as SBERT gives a high similarity score, which is fine, since the sentences are in fact quite similar. However, they're similar because they're mostly on the same stuff/topics, but they're dissimilar in their use of number: in the first sentence there's one man, in the second sentence there's two!

My idea was to fine-tune the text embedding model such that we have multiple sub-embeddings of which we know what's in them. This way, we can inspect how the overall score is regulated. E.g. in the example, we'd have a high score since the sentences have the same topic and the "topic" sub-embeddings match well, but we also modulate the score slightly downwards since our "number"-sub-embeddings that have the task of capturing quantification/number information are different.

I've written some code that allows you to structure text embeddings into interpretable semantic features according to your use-case. The basic steps are really simple:

  1. Define a few interpretable metrics that measure similarity wrt to certain aspects you're interested in (e.g., polarity, negative/positive sentiment, topic... and so on, you can be creative!).

  2. Assign each metric some part of the embedding

  3. Fine tune some sentence embedding model on the metric scores, such that the information gets pushed to the assigned parts and your interpretable metrics get reflected.

During training we pay attention that we don't mess up the model and control the information routing process by ensuring that the overall similarity of the embeddings stays about the same as the similarity when using a frozen embedding model.

In the end, the final text embedding is structured into different sub-embeddings. You can use these sub-embeddings for fine-grained semantic search or clustering, or simply to explain a similarity rating of the embedding model.

Here's the code for structuring your custom embeddings:

https://github.com/flipz357/S3BERT

Code is under free public MIT license.

you are viewing a single comment's thread
view the rest of the comments
[–] kazza789@alien.top 1 points 10 months ago

Could you use a traditional embedding, and then somehow search for a vector that represents the semantic feature you are interested in? What I mean is that, since LLMs can understand the concept of numbers, and this is a pretty fundamental part of language, presumably (but not necessarily) there is a vector in the high dimensional embedding space that represents the concept of "how many". I'm thinking, of course, along the lines of the traditional example of "king" - "male" + "female" = "queen", where you could, for example, define a "gender" vector based on "male", "female" and perhaps a set of other related words.

I'm not sure how feasible that is at all - I'm just curious if it's something you explored or read about as you were doing this?