LocalLLaMA

1 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago

MODERATORS

communick@poweruser.forum

Dataset de-duplication methods (alien.top)

submitted 10 months ago by Dry_Long3157@alien.top to c/localllama@poweruser.forum

1 comments fedilink hide all child comments

Hey everyone,

I have a dataset that has around 8million pairs of prompts and responses collected and curated from a bunch of open-source datasets on hf. I wanted to know what's the best method to dedup this dataset. I am planning on doing this locally (4090 with 64gb ram) and I've looked into a few methods but I wasn't able to use those in my case cuz of my compute constraints.

Please let me know if y'all know a efficient method I can use!

TIA.

you are viewing a single comment's thread
view the rest of the comments

[–] Careless-Age-4290@alien.top 1 points 10 months ago

You could do a hash of each Q/A pair into a dictionary as you iterate through them and only keep each one if its hash doesn't exist yet. If you're looking to do a fuzzier search, you could do cosine similarity and throw out anything that's got too close of a nearest neighbor.