Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[P][D] A100 is much slower than expected at low batch size for text generation (alien.top)

submitted 2 years ago by currytrash97@alien.top to c/machinelearning@academy.garden

7 comments fedilink hide all child comments

I’m working on a project to generate text from a 1.2B parameter full precision LLM (5gb)

Unfortunately I’m limited in the infrastructure I can use to deploy this model. There is no batch inference supported. The infrastructure I have allows me to deploy a copy of the model on a single A100, 1 per process with up to 9 processes supported (these are called “replicas”). I understand that this makes little sense given my model is memory bound, and each process will fight for memory bandwidth to read in the same weights, but I can’t change that for now.

My average input and output tokens are roughly 1000 each. I estimate the kv cache per token is roughly 400kB using full precision.

I have benchmarks of the latency of the model using various “replicas” as described above. I wanted to compare this to the theoretical performance of the A100. For my use case time to first token is negligible (<200ms), and generation is memory bound.

I find that with 5 or more replicas, the math works out and my model is roughly as fast as I expect. For example, with 1000 output tokens, 6 replicas, it’s like I’m generating using a batch of 6 requests from a 30gb model + 5gb for the kv cache. At a memory bandwidth around 1-1.3tbps that translates to ~30s per request, which is not far from what I see. The same goes for other replica numbers, 5, 7, 8 and 9.

However, when I run with a single replica, I expect generation to hover around the 5-6s mark on average. Instead, I see > 20s. I need to add 4 more replicas before the number starts to make sense. It almost seems like the model takes up too little memory to be allocated the entire memory bandwidth.

Does anyone know where this extra latency could be coming from? Do models have to reach a certain amount of used memory for A100 memory bandwidth to hit their available memory bandwidth?

top 7 comments

sorted by: hot top controversial new old

[–] Smallpaul@alien.top 1 points 2 years ago

/r/localllama

[–] tornado28@alien.top 1 points 2 years ago

A lot of people use CPU for inference. I'm guessing it's because of the batching issue that you describe.

[–] fediverser@alien.top 1 points 2 years ago

This post is an automated archive from a submission made on /r/MachineLearning, powered by Fediverser software running on alien.top. Responses to this submission will not be seen by the original author until they claim ownership of their alien.top account. Please consider reaching out to them let them know about this post and help them migrate to Lemmy.

Lemmy users: you are still very much encouraged to participate in the discussion. There are still many other subscribers on !machinelearning@academy.garden that can benefit from your contribution and join in the conversation.

Reddit users: you can also join the fediverse right away by getting by visiting https://portal.alien.top. If you are looking for a Reddit alternative made for and by an independent community, check out Fediverser.

[–] entropyvsenergy@alien.top 1 points 2 years ago

Replicas shouldn't affect anything except memory pressure on the GPU and maybe a little overhead if you're sending inference requests one at a time.

The replicas allow you to run multiple inference requests simultaneously.

If you're using python, they are threads not "processes" though in unixspeak they're separate processes.

So once you have your inference container up and running, the latency per request should be relatively constant and shouldn't vary with the number of replicas.

With sufficient memory bandwidth, more replicas (up to what the gpu memory can support) should increase throughout proportional to the number of replicas.

If you're memory bandwidth limited, you should see a sublinear increase in throughput.

[–] jorgemf@alien.top 1 points 2 years ago

Think about this: 9 women take 9 months to have 9 babies, so 1 woman should take 1 month. That is basically what you are saying.

What is basically happening is that one request takes >20seconds but your GPU utilization is far from 100%. Basically you are not using the full parallelism potential of the GPU because that is how the model works. There is very little you can do there. But when you have 6 requests at the same time, the GPU can make those requests parallel with 100% usage of its resources.

[–] TheGuywithTehHat@alien.top 1 points 2 years ago

It is very normal/usual/expected that a GPU won't run at 100% unless you provide it with enough parallel computations to perform (either extremely large layers or a big batch of samples). There's just so much overhead in so many places, it's impossible to be efficient at a small scale, and there's no one thing we can point to and say "that's why it's slow". My best advice is to look into the optimization possibilities provided by the model/framework/version you're using. I'm in pytorch, and use things like torch.jit.script, torch.jit.trace, torch.compile, torch.autocast, etc.

[–] PM_ME_YOUR_DOOTFILES@alien.top 1 points 2 years ago

Do you have more specifics about the trend from 2-4 replicas?

If we see a trend where 2-4 replica takes each take >20s without improvement, then that might be a odd programming glitch that can't be explained by math.

If we see a gradual trend, then maybe the math is not accounting for something at lower replicas.