this post was submitted on 20 Nov 2023
1 points (100.0% liked)
LocalLLaMA
1 readers
1 users here now
Community to discuss about Llama, the family of large language models created by Meta AI.
founded 10 months ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
First a question - what sort of performance can you get with ctransformers compared to AutoAWQ or tensorllm? I thought that ggml based stuff would not scale good if you are doing batches of 1000 generations at the the time, and gpu-first libraries would outperform it.
Now to tutorials. I feel like most of the community does not have more than 4 gpu's, or a need to scale workloads that far, so I am not sure about how useful that would be for us/them. You can write a tutorial about splitting training among 10-20 commercial GPUs where each gpu has less vram than needed for micro_batch_size. Then the same for inference - sharding models and running parts of them on various gpu's. There was interest in builds of 10x P40 and 8x 3080 cards, so while maybe not the most practical, doing some fun experiments with this kind of setup might be interesting to community. Basically in-depth Deepspeed, FSDP, 3D parallelism.
If are are not committed to write about scaling to 200 gpu's to push your service, you should write in-depth ins and outs of qlora, with experiments on various parameters, memory scaling, rank scaling, relation of training speed to gpu core perf and memory speed, applicable scaling laws. All of it with most popular tools used in the community and the ones not as known. For the popular ones we have axolotl and training in oobabooga webui, less popular are those like h2o studio. There are probably more that I don't know. Quite a lot of people here are asking basic questions about qlora. While I like answering those, having a good comprehensive tutorial would be nice.