this post was submitted on 30 Oct 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Blog: https://together.ai/blog/redpajama-data-v2

Hugging Face: https://huggingface.co/datasets/togethercomputer/RedPajama-Data-V2

GitHub: https://github.com/togethercomputer/RedPajama-Data

Description:

RedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text documents coming from 84 CommonCrawl snapshots and processed using the CCNet pipeline. Out of these, there are 30B documents in the corpus that additionally come with quality signals, and 20B documents that are deduplicated.

you are viewing a single comment's thread
view the rest of the comments
[–] UserMinusOne@alien.top 1 points 1 year ago

How much free space is required to do a "git clone ..."?

Is there a better method to download the data without requiring additional space for the history (.git). If yes, how big is the whole dataset?

Given the current developments: Maybe some should start collecting raw data and serving them as torrents. ... Just in case.