overview for CocksuckerDynamo

Good, cheap cloud providers for llm work in c/localllama@poweruser.forum

[–] CocksuckerDynamo@alien.top 1 points 2 years ago

I use runpod for everything I can't do locally and I've been very happy with it. I initially chose it just because it was one of the cheapest, indeed way cheaper than the big 3, but I've had a good experience.

The main downside I know of runpod is that you can only run a container image, you can't have a full VM. but for most use cases I think this is really no big deal. if you want a generic sandbox for interactive experimentation, rather than to run an actual containerized app, you can just use the runpod pytorch image to get a starting point with cuda and pytorch and some other common stuff installed and then just ssh into it and do whatever. i.e. you don't necessarily have to bother with a more "normal" containerized deployment where you're writing something that runs unattended or exposes an API or whatever, writing a dockerfile etc

full disclosure my recent experiments are all testing different setups for inference with continuous batching, i'm personally not doing training or finetuning. but as far as I can tell runpod would be equally applicable for training and finetuning tasks

What are your thoughts on the future of LLMs running mobile? in c/localllama@poweruser.forum

[–] CocksuckerDynamo@alien.top 1 points 2 years ago

round trip latency of an http request (or grpc or whatever pick your poison) is utterly insignificant compared to the time it takes to run the inference process, even for the smallest models with the fastest inference

any open source LLM you want scaled to 200 gpus I will create a tutorial for in c/localllama@poweruser.forum

[–] CocksuckerDynamo@alien.top 1 points 2 years ago

what is different/better about whatever you are attempting to suggest compared to the existing prominent solutions such as vLLM, TensorRT-LLM, etc?

it's not clear to me exactly what the value proposition is of what you're offering.

Looking for a Less Formal Generative Model in c/localllama@poweruser.forum

[–] CocksuckerDynamo@alien.top 1 points 2 years ago

preferably around 7 billion parameters

aim to produce flawless generations

LMAO goooooooooood fuckin luck buddy

NVidia H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM in c/localllama@poweruser.forum

[–] CocksuckerDynamo@alien.top 1 points 2 years ago

Its useful for people who want to know the inference response time.

No, it's useful for people who want to know the inference response time with batch size 1, which is not something that prospective H200 buyers care about. Are you aware that deployments in business environments for interactive use cases such as real time chat generally use batching? Perhaps you're assuming request batching is just for offline / non-interactive use, but that isn't the case.

The Problem with LLMs for chat or roleplay in c/localllama@poweruser.forum

[–] CocksuckerDynamo@alien.top 1 points 2 years ago

Anyone has any solutions for these?

Use a high quality model.

That means not 7B or 13B.

I know a lot of other people have already said this in the thread, but this keeps coming up in this sub so I'm just gonna say it too.

Bleeding edge 7B and 13B models look good in benchmarks. Try actually using them and the first thing you should realize is how poorly benchmark results indicate real world performance. These models are dumb.

You can get started on runpod by depositing as little as $10, that's less than some fast food meals, just take the plunge and find out for yourself. If you use an RTX A6000 48GB they'll only charge you $0.79 per hour so you get quite a few hours of experimenting to feel the difference for yourself. With 48GB VRAM you can run Q4_K_M quants of 70B with full GPU offloading, or try Q5_K_M or even Q6 or Q8 if you tweak the number of layers you're offloading to fit within 48GB (and still get fast enough generations for interactive chat.)

The difference is just absolutely night and day. Not only do 70Bs rarely make the basic mistakes you are describing, sometimes they even surprise me in a way that feels "clever."

NVidia H200 achieves nearly 12,000 tokens/sec on Llama2-13B with TensorRT-LLM in c/localllama@poweruser.forum

[–] CocksuckerDynamo@alien.top 1 points 2 years ago (2 children)

Its would be better if they provide single batch information for normal inference on fp8.

better for who? people that are just curious or people that are actually going to consider buying H200s?

who is buying a GPU that costs more than a new car and using it for single batch?

Has There Been Any Research on Curriculum Learning for Pre-training or Fine-Tuning Large Language Models? in c/localllama@poweruser.forum

[–] CocksuckerDynamo@alien.top 1 points 2 years ago (1 children)

this is sort of like what you're talking about, and pretty interesting IMO:

https://www.youtube.com/watch?v=24O1KcIO3FM

https://arxiv.org/abs/2309.05463

Question about the 'economics' of running a LLM locally? in c/localllama@poweruser.forum

[–] CocksuckerDynamo@alien.top 1 points 2 years ago (1 children)

If you take a really sober look at the numbers, how does running your own system make sense over renting hardware at runpod or a similar service?

To me it doesn't. I use runpod, I'm just on this sub because it's the best place I know to keep up on the latest news in open source / self-hosted LLM stuff. I'm not literally running it "locally."

As far as I can tell there are lots of others like me here on this sub. Of course also many people here run on their own hardware, but it seems to me like the user base here is pretty split. I wonder what a poll would find.