LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models (alien.top)

submitted 2 years ago by APaperADay@alien.top to c/localllama@poweruser.forum

6 comments fedilink hide all child comments

Blog: https://together.ai/blog/redpajama-data-v2

Hugging Face: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2

GitHub: https://github.com/togethercomputer/RedPajama-Data

Description:

RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30B documents in the corpus that additionally come with quality signals, and 20B documents that are deduplicated.

you are viewing a single comment's thread
view the rest of the comments

[–] FairSum@alien.top 1 points 2 years ago

Man, 30T tokens deduplicated is a lot of data.

For reference, Llama 2 was trained on 2T tokens and GPT-4 was believed to have been trained on 13T tokens (and my suspicion is Turbo was too). This is much, much more than that.