this post was submitted on 10 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I have some basic confusions over how to prepare a dataset for training. My plan is to use a model like llama2 7b chat, and train it on some proprietary data I have (in its raw format, this data is very similar to a text book). Do I need to find a way to reformat this large amount of text into a bunch of pairs like "query" and "output" ?

I have seen some LLM's which say things like "trained on Wikipedia" which seems like they were able to train it on that large chunk of text alone without reformatting it into data pairs - is there a way I can do that, too? Or since I want to target a chat model, I have to find a way to convert the data into pairs which basically serve as examples of proper input and output?

you are viewing a single comment's thread
view the rest of the comments
[–] FPham@alien.top 1 points 10 months ago (2 children)

Trained and finetuned - 2 things.

The trained on wikipedia - yes, they feed the wikipedia articles to it - hook and sinker. No Q/A. But that doesn't mean it will be able to give you answer, unless you fine tune it with Q/A "I want you to behave like this" template - but the kick is - what we all are using to our huge advantage - it can be fine-tuned on a totally different Q/A, it will still be able to answer from wikipedia. It's a hat trick.

[–] psdwizzard@alien.top 1 points 10 months ago

I am new to LLMs (I normally train Image Models) so if this is a stupid question let me know.

I have been converting the shadowrun lore wiki into Q and A so i can use that model for a sillytavern character as a contact in my current tabletop game. Do I really need to convert it all to Q and A? If I get a better "Contact" I dont mind.

[–] ArtifartX@alien.top 1 points 10 months ago

Thanks for the information and explanation