this post was submitted on 31 Oct 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 2 years ago
MODERATORS
 

Blog: https://together.ai/blog/redpajama-data-v2

Hugging Face: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2

GitHub: https://github.com/togethercomputer/RedPajama-Data

Description:

RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30B documents in the corpus that additionally come with quality signals, and 20B documents that are deduplicated.

top 1 comments
sorted by: hot top controversial new old
[–] md1630@alien.top 1 points 2 years ago