vicks9880

joined 1 year ago
[–] vicks9880@alien.top 1 points 11 months ago (1 children)

These are just names, LLaMA originally meant ( Large language model Meta AI), but it appears that its also the name of South American animal, thus creative people of internet who download those weights, fine-tuned it and published it with the other animals of same family like alpaca, vicuna, dalai (llama) etc.

There are more important information in these model names, which are suffix, like parameter counts13B, quantization methods GGUF/GGML, fine-tuning techniques like LoRA, fine-tuning parameters like Q6_K_M etc are used.

[–] vicks9880@alien.top 1 points 11 months ago

Talk about publishing, everyone and their mom creating blogs and posting them everywhere posoble.. And these people just read quick start page of any new library and flood the internet with mediocre content. I'm tired of looking through hundreds of such article to find one whenever I want to do something which is just one step more than hello world.

[–] vicks9880@alien.top 1 points 11 months ago

I have 50GB free, but I believe they reduced it to 20GB for new users. Plenty enough for the application data.

 

I have been self hosting from quite a while and recently I had to restart my server. After restarting, for some reason portainer didn't recognise any of my stacks and all my stacks becaome "limited" and I lost all my docker_compose for all services. So decided to create this ansible script which keep all my configuration in one place in case I have to reinstall my server.

The script automates all my setup which used to take hours of work and now I can have everything up and running in minutes in case of failure.

I am currently working on backing up configuration files for all my docker services to Mega (free plan) using MegaCMD Backup, which keeps a rotational backup of the files ( last 10 days in my case).

Would like to share with all and hope to get some improvements/ideas:

https://github.com/vikramsoni2/homelab_infra

[–] vicks9880@alien.top 1 points 11 months ago

Vllm is performing good so far. Better than expected. Using distributed gpu and trying to work on extending gpu based on load. Need to figure out correct metric on which to trigger the scaling up/down

 

I'm currently exploring the deployment of Llama models in a production environment and I'm keen to hear from anyone who has ventured into this territory. My primary concern is managing multiple concurrent users while optimizing resources effectively.

While there are numerous methods to tweak Llama for testing with a single user, scaling up poses its own set of challenges. I'm particularly interested in learning how others have approached this problem.I'm curious about projects like vLLM and Huggingface TGI for faster inference. Has anyone had experience with these, and how have they contributed to your scaling efforts?

My goal is to implement an API utilizing Llama models for a small organization's private use. I'm eager to learn from your experiences and any advice or insights you can share on this topic.