I had the same question a few weeks back and this blog post was really helpful for me: https://together.ai/blog/redpajama-data-v2 . The scripts used are also open sources on the github repo.
I had the same question a few weeks back and this blog post was really helpful for me: https://together.ai/blog/redpajama-data-v2 . The scripts used are also open sources on the github repo.