I'm still shocked at how good mistral is. I wrote it off as a meme model for far too long just because of how overstated the praise seemed to be. But the thing really is amazing for the size.
toothpastespiders
Just one, and assuming no extra training? I think I'd go with Capybara Tess Yi 34b. In part because of how well it seems to follow instructions. But also because it has the broadest scope of knowledge that I've seen in any of the models so far. A lot of the models tap out on a lot of things past what you'd get from the first paragraph of Wikipedia. I get that feeling far less often with capy so far.
I'd like to know too if there's one for exactly $1. Even half a buck or so difference builds up over time.
But runpod's close at least, at $1.69/hour.
Oh yeah, you're absolutely going to want to go with a llama2 model over the options you've looked at already. The only one of them I have direct experience with is GPT-2. But the worst llama models I've seen still feel like night and day in comparison to GPT2.
Personally, I think you'd be best off going with a combination of fine-tuning with your own data and using RAG in order to get as far away from hallucinations as possible. Not everyone agrees, but I think that both in tandem is the way to go.
I think that the language is going to be the larger issue. This is just conjecture on my part. But I suspect that a powerful model that's only trained on 'your' dutch data and is otherwise focused on English would probably end up performing worse to Dutch prompts than a less capable model that was trained with large amounts of miscellaneous Dutch language data in addition to your own.
I remember this Dutch 7b model was released fairly recently. It was created from a base llama2 chat model. Which means it probably also has a lot of the more "corporate style" tone that most people here are trying to avoid. But given the context, I think that might actually be an advantage for you. Being safe for work/school is probably a bit of a priority.
7b also has the advantage of being very light on resource usage. And I mean very, very, light. I've been using a 7b model for some automated tasks on spare hardware that doesn't even have a GPU. It's entirely running on an ancient CPU. And while slow, it's not unbearably so.
I'm really late on this one, but dolphin 2.0 mistral 7b. I did a little extra training on it for some automation and the thing's ridiculously solid, fast, and light on resource usage. I'm still cleaning up the output a bit after it's chugging away at night. But to a pretty minor degree.
Though if failures count then Yi 34b's up there in terms of usage this week too. As I fail a million times over just to train a simple, single, usable, lora for it.
Holy shit. I've been holding off on looking too deeply into LLaVA given how many things are always popping up. But that's just too cool to pass up on. The amount of potential applications, if it works as well as I'm hoping, is wild.
I feel like I just inadvertently sold my soul for access to an 8b model with all that agreement clicking.
They told me the grammar police would come for me one day. Why wasn't I more careful with my interrobanging‽
Dang, after that 34b drought it's like suddenly stumbling onto the great lakes right now.
The choice of question in there is particularly insightful. All AI-related tasks should focus on spiders.
Dang, given that I was already impressed with a model trained on half the tokens I suspect I will be impressed!
What I have so far is such a hacky mess that I'm nowhere near comfortable uploading yet. But I've been trying to put together an automated system to do something similar. Basically, toss books and journal articles into a folder by subject, get dataset out in the morning.
The bad news is that I didn't have much luck with any of the local models in their default state. But that was months ago so things might have improved. And I only tested what I thought were promising models. Wouldn't shock me to find that I just missed something that would have worked right out of the box. Also wouldn't shock me if someone hopped on and just pointed you to a model that works perfectly for this.
That said, here's what's mostly worked for me. I just made a dataset with 100 to 150 or so examples for the subjects I was working with. Basically a dataset filled with examples of how to make a dataset.
Sounds mind numblingly tedious, I know. But a handful a day and it's over pretty quick. I made a point in particular to include a lot of examples where the source data was messed up. Poorly converted pdfs, text with image file names scattered, page numbers littering the text, etc. To make sure that I was giving it the worst that it might encounter as well as the more optimal examples.
Made the instruction prompt something like "Format the following text about x into alpaca formatted json. Here's the text:" folowed by the text. Then put the json data I'd want in the output field. Then I did some additional training with that dataset on a few models. That was enough to vastly improve the results.
Up until very recently the best results I got with that method were with dolphin 2.0 mistral 7B and Microsoft's Orca-2 13b. Not perfect, but I'm hand editing the generated dataset for about ten minutes for a textbook. Sometimes less, sometimes more.
The big shock was Capybara Tess Yi 34b 200k though. I only ran a couple of tests, so this might be a fluke, but after training with the same dataset I was getting perfect results. Something I'd never seen before with anything other than gpt4. Though I'm finishing up a big batch right now with the 13b model so haven't had a chance to swap it in through the automation and see if it lives up to that outside the quick test run. It's worth noting too that I never tried the dataset generation with Capybara Tess Yi 34b 200k in its normal state, without my extra training applied to it. Might very well be that it'd be just as perfect in its default state. So if you're testing models, that's the one I'd point to for a possible solution that wouldn't require any more work.
So yeah, in short my advice is to just make a dataset with about 100 to 150 examples of how to make a dataset. Going by my results that should be enough to get you pretty close to what you're looking for.