entropyvsenergy

joined 2 years ago

[P][D] A100 is much slower than expected at low batch size for text generation in c/machinelearning@academy.garden

[–] entropyvsenergy@alien.top 1 points 2 years ago

Replicas shouldn't affect anything except memory pressure on the GPU and maybe a little overhead if you're sending inference requests one at a time.

The replicas allow you to run multiple inference requests simultaneously.

If you're using python, they are threads not "processes" though in unixspeak they're separate processes.

So once you have your inference container up and running, the latency per request should be relatively constant and shouldn't vary with the number of replicas.

With sufficient memory bandwidth, more replicas (up to what the gpu memory can support) should increase throughout proportional to the number of replicas.

If you're memory bandwidth limited, you should see a sublinear increase in throughput.

permalink
fedilink
source