overview for ArtifartX

“The difference between screwing around and science is writing it down.” ― Adam Savage in c/localllama@poweruser.forum

[–] ArtifartX@alien.top 1 points 2 years ago

I've definitely seen a few of those.

X.AI Grok could potentially be open sourced on a 6 month delay from launch in c/localllama@poweruser.forum

[–] ArtifartX@alien.top 1 points 2 years ago

Anyone dumb enough to take a timeline that comes out of Musk's mouth seriously for anything in this day and age...

“The difference between screwing around and science is writing it down.” ― Adam Savage in c/localllama@poweruser.forum

[–] ArtifartX@alien.top 1 points 2 years ago (3 children)

I agree with your sentiment here. But, you can't deny the influx of papers that are intentionally something extremely simple or inconsequential that are deliberately dressed up to try to look as complex as possible just in order to get published. Regardless of your sentiment (which again I agree with mostly), those kinds of papers are not good and we'd all be better off without them. I think there is a place for shame for certain types of papers, and I would disagree with the idea that shame is always bad or shouldn't be used as a tool.

GPT-4's 128K context window tested in c/localllama@poweruser.forum

[–] ArtifartX@alien.top 1 points 2 years ago

If the fact was at the beginning of the document, it was recalled regardless of context length

Lol at OpenAI adding a cheap trick like this, since they know the first thing people will test at high context lengths is recall from the beginning.

Point me towards some basic dataset preparation tips for LLM's? in c/localllama@poweruser.forum

[–] ArtifartX@alien.top 1 points 2 years ago

Yea, doing this is part of what spurred the question, because I began to notice some datasets that were very clean and ordered into data pairs, and others that seemed formatted differently, and others still that seemed like they were fed a massive chunk of unstructured text. It made me confused on if there were some sort of standards or not that I was not aware of.

Point me towards some basic dataset preparation tips for LLM's? in c/localllama@poweruser.forum

[–] ArtifartX@alien.top 1 points 2 years ago

Awesome, thank you, at a glance this looks like it will be very helpful

Point me towards some basic dataset preparation tips for LLM's? in c/localllama@poweruser.forum

[–] ArtifartX@alien.top 1 points 2 years ago

Thanks for the information and explanation

1

Point me towards some basic dataset preparation tips for LLM's? (alien.top)

submitted 2 years ago by ArtifartX@alien.top to c/localllama@poweruser.forum

12 comments fedilink

I have some basic confusions over how to prepare a dataset for training. My plan is to use a model like llama2 7b chat, and train it on some proprietary data I have (in its raw format, this data is very similar to a text book). Do I need to find a way to reformat this large amount of text into a bunch of pairs like "query" and "output" ?

I have seen some LLM's which say things like "trained on Wikipedia" which seems like they were able to train it on that large chunk of text alone without reformatting it into data pairs - is there a way I can do that, too? Or since I want to target a chat model, I have to find a way to convert the data into pairs which basically serve as examples of proper input and output?

API to a service that offers access to open source LLMs? in c/localllama@poweruser.forum

[–] ArtifartX@alien.top 1 points 2 years ago

Replicate

For roleplay purposes, Goliath-120b is absolutely thrilling me in c/localllama@poweruser.forum

[–] ArtifartX@alien.top 1 points 2 years ago (1 children)

What service do you use for GPU rental and inference for it?

dolphin-2.2-70b in c/localllama@poweruser.forum

[–] ArtifartX@alien.top 1 points 2 years ago

It is still relatively censored, but a great base to work with.