this post was submitted on 23 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Hey everyone,

I have a dataset that has around 8million pairs of prompts and responses collected and curated from a bunch of open-source datasets on hf. I wanted to know what's the best method to dedup this dataset. I am planning on doing this locally (4090 with 64gb ram) and I've looked into a few methods but I wasn't able to use those in my case cuz of my compute constraints.

Please let me know if y'all know a efficient method I can use!

TIA.

you are viewing a single comment's thread
view the rest of the comments
[–] Careless-Age-4290@alien.top 1 points 10 months ago

You could do a hash of each Q/A pair into a dictionary as you iterate through them and only keep each one if its hash doesn't exist yet. If you're looking to do a fuzzier search, you could do cosine similarity and throw out anything that's got too close of a nearest neighbor.