LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

Point me towards some basic dataset preparation tips for LLM's? (alien.top)

submitted 1 year ago by ArtifartX@alien.top to c/localllama@poweruser.forum

12 comments fedilink hide all child comments

I have some basic confusions over how to prepare a dataset for training. My plan is to use a model like llama2 7b chat, and train it on some proprietary data I have (in its raw format, this data is very similar to a text book). Do I need to find a way to reformat this large amount of text into a bunch of pairs like "query" and "output" ?

I have seen some LLM's which say things like "trained on Wikipedia" which seems like they were able to train it on that large chunk of text alone without reformatting it into data pairs - is there a way I can do that, too? Or since I want to target a chat model, I have to find a way to convert the data into pairs which basically serve as examples of proper input and output?

you are viewing a single comment's thread
view the rest of the comments

[–] __SlimeQ__@alien.top 1 points 1 year ago (1 children)

if you're making a lora, training on wikipedia directly will pretty much make it output text that looks like wikipedia. which is to say it will (probably) be worse at chatting.

a strategy i've been using lately is to get gpt4 to make a conversation in my chosen format *about* each chapter of my "textbook", i can automate this with pretty good results and it's done in about 10 minutes. It does kind of work, it'll at least get the bot to talk about the topics I chose, but as far as actually comprehending the information it's referencing... it's bad. It gets better as I increase rank, but it takes a lot of VRAM. I can only get to around 256 before it'll die

[–] Hey_You_Asked@alien.top 1 points 1 year ago

please share!!