ArtifartX

joined 10 months ago
[–] ArtifartX@alien.top 1 points 9 months ago

I've definitely seen a few of those.

[–] ArtifartX@alien.top 1 points 9 months ago

Anyone dumb enough to take a timeline that comes out of Musk's mouth seriously for anything in this day and age...

[–] ArtifartX@alien.top 1 points 10 months ago (3 children)

I agree with your sentiment here. But, you can't deny the influx of papers that are intentionally something extremely simple or inconsequential that are deliberately dressed up to try to look as complex as possible just in order to get published. Regardless of your sentiment (which again I agree with mostly), those kinds of papers are not good and we'd all be better off without them. I think there is a place for shame for certain types of papers, and I would disagree with the idea that shame is always bad or shouldn't be used as a tool.

[–] ArtifartX@alien.top 1 points 10 months ago
  • If the fact was at the beginning of the document, it was recalled regardless of context length

Lol at OpenAI adding a cheap trick like this, since they know the first thing people will test at high context lengths is recall from the beginning.

[–] ArtifartX@alien.top 1 points 10 months ago

Yea, doing this is part of what spurred the question, because I began to notice some datasets that were very clean and ordered into data pairs, and others that seemed formatted differently, and others still that seemed like they were fed a massive chunk of unstructured text. It made me confused on if there were some sort of standards or not that I was not aware of.

[–] ArtifartX@alien.top 1 points 10 months ago

Awesome, thank you, at a glance this looks like it will be very helpful

[–] ArtifartX@alien.top 1 points 10 months ago

Thanks for the information and explanation

 

I have some basic confusions over how to prepare a dataset for training. My plan is to use a model like llama2 7b chat, and train it on some proprietary data I have (in its raw format, this data is very similar to a text book). Do I need to find a way to reformat this large amount of text into a bunch of pairs like "query" and "output" ?

I have seen some LLM's which say things like "trained on Wikipedia" which seems like they were able to train it on that large chunk of text alone without reformatting it into data pairs - is there a way I can do that, too? Or since I want to target a chat model, I have to find a way to convert the data into pairs which basically serve as examples of proper input and output?

[–] ArtifartX@alien.top 1 points 10 months ago
[–] ArtifartX@alien.top 1 points 10 months ago (1 children)

What service do you use for GPU rental and inference for it?

[–] ArtifartX@alien.top 1 points 10 months ago

It is still relatively censored, but a great base to work with.