this post was submitted on 20 Nov 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

I'm trying to perfect a dev tool for python developers to easily scale their code to thousands of cloud resources using only one line of code.

I want to get some project ideas so I can build useful tutorials for running inference and fine tuning open source LLMs.

A few weeks back I created a tutorial teaching people to massively parallelize inference with Mistral-7B. I was able to deliver a ton of value to a select few people and it helped me better understand the flaws with my tool.

Anyways I want to open it up to the community before I decide what tutorials I should prioritize. Please drop any project/tutorial ideas and if you think someone's idea is good please upvote them (so I know you think it would be valuable).

you are viewing a single comment's thread
view the rest of the comments
[–] FullOf_Bad_Ideas@alien.top 1 points 10 months ago

First a question - what sort of performance can you get with ctransformers compared to AutoAWQ or tensorllm? I thought that ggml based stuff would not scale good if you are doing batches of 1000 generations at the the time, and gpu-first libraries would outperform it.

Now to tutorials. I feel like most of the community does not have more than 4 gpu's, or a need to scale workloads that far, so I am not sure about how useful that would be for us/them. You can write a tutorial about splitting training among 10-20 commercial GPUs where each gpu has less vram than needed for micro_batch_size. Then the same for inference - sharding models and running parts of them on various gpu's. There was interest in builds of 10x P40 and 8x 3080 cards, so while maybe not the most practical, doing some fun experiments with this kind of setup might be interesting to community. Basically in-depth Deepspeed, FSDP, 3D parallelism.

If are are not committed to write about scaling to 200 gpu's to push your service, you should write in-depth ins and outs of qlora, with experiments on various parameters, memory scaling, rank scaling, relation of training speed to gpu core perf and memory speed, applicable scaling laws. All of it with most popular tools used in the community and the ones not as known. For the popular ones we have axolotl and training in oobabooga webui, less popular are those like h2o studio. There are probably more that I don't know. Quite a lot of people here are asking basic questions about qlora. While I like answering those, having a good comprehensive tutorial would be nice.