LocalLLaMA

11 readers

4 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

communick@poweruser.forum

Is there a technical reason that distributed LLMs don't exist? (alien.top)

submitted 2 years ago by chinawcswing@alien.top to c/localllama@poweruser.forum

31 comments fedilink hide all child comments

Why is there no analog to napster/bittorent/bitcoin with LLMs?

Is there a technical reason that there is not some kind of open source LLM that we can all install on our local host which contributes computing power to answering prompts, and rewards those who contribute computing power by allowing them to enter more prompts?

Obviously, there must be a technical reason which prevents distributed LLMs or else it would have already been created by now.

you are viewing a single comment's thread
view the rest of the comments

[–] Howrus@alien.top 1 points 2 years ago (1 children)

Simple answer is that you can't parallelize LLM work.
It generate answers word-by-word, (or token-by-token to be more precise) so it's impossible to split task into 10-100-1000 different pieces that you could send into this distributed network.

Each word in the LLM answer also serve as part of input to calculate next one, so LLMs are actually counter-distributed systems.

[–] damhack@alien.top 1 points 2 years ago

You’d better tell the GPU manufacturers that LLM workloads can’t be parallelized.

The point of Transformers is that the matrix operations can be parallelized, unlike in standard RNNs.

The issue with distributing those parallel operations is that for every partition of the workload, you introduce latency.

If you offload a layer at a time, then you are introducing both the latency of the slowest worker and the network latency, plus the latency of combining results back into one set.

If you’re partitioning at finer grain, eg parts of a layer, then you add even more latency.

Latency can go from 1ms per layer in a monolithic LLM to >1s. That means response times measured in multiple minutes.