overview for multiverse

1

Running Multiple WebUI instances (follow up from my question yesterday) (alien.top)

submitted 2 years ago by multiverse_fan@alien.top to c/localllama@poweruser.forum

2 comments fedilink

It's working great so far. Just wanted to share and spread awareness that running multiple instances of webui (oobabooga) is basically a matter of having enough ram. I just finished running three models simultaneously (taking turns of course). Only offloaded one layer to gpu per model, used 5 threads per model, and all contexts were set to 4K. (the computer has 6 core cpu, 6GB vram, 64GB ram)

The models used were:

dolphin-2.2.1-ashhlimarp-mistral-7b.Q8_0.gguf

causallm_7b.Q5_K_M.gguf

mythomax-l2-13b.Q8_0.gguf (i meant to load a 7B on this one though)

I like it because it's similar to the group chat on character.ai but without the censorship and I can edit any of the responses. Downsides are having to copy/paste between all the instances of the webui, and it seems that one of the models was focusing on one character instead of both. Also, I'm not sure what the actual context limit would be before the gpu would go out of memory.

https://preview.redd.it/8i6wwjjtt54c1.png?width=648&amp%3Bformat=png&amp%3Bauto=webp&amp%3Bs=26adca2a850f62165301390cdd4ba11548447c0d

https://preview.redd.it/3c9z5ee9u54c1.png?width=1154&amp%3Bformat=png&amp%3Bauto=webp&amp%3Bs=210d7c67bcf0efafeb3f328e76199f13159dae64

https://preview.redd.it/lt8aizhbu54c1.png?width=1154&amp%3Bformat=png&amp%3Bauto=webp&amp%3Bs=d24f8b2bf899084bbdb11d73e34b5564b629e0be

https://preview.redd.it/8lbl4nzeu54c1.png?width=1154&amp%3Bformat=png&amp%3Bauto=webp&amp%3Bs=a81b8f1d8630e3d17ad37885915f8c7e3077584c

Models Megathread #2 - What models are you currently using? in c/localllama@poweruser.forum

[–] multiverse_fan@alien.top 1 points 2 years ago

What would have happened if ChatGPT was invented in the 17th century? MonadGPT is a possible answer.

TheBloke/MonadGPT-GGUF

Question about GGUF, gpu offload and performance in c/localllama@poweruser.forum

[–] multiverse_fan@alien.top 1 points 2 years ago

I have an older 6GB 1660 and get like 0.3 t/s on a q2 quant of Goliath 120B. I guess I'm just thinking that comparatively your setup with a 20B model should be faster than that but I'm sure I'm missing something. I guess with offloading, the CPU plays a role as well. How many cores ya got?

1

Anyone have a 1B or 3B model that is mostly coherent? (alien.top)

submitted 2 years ago by multiverse_fan@alien.top to c/localllama@poweruser.forum

16 comments fedilink

I've tried a few of these models but it was some months ago. Have y'all seen any that can hold a conversation yet?

AMD EPYC CPU or 1x RTX 4090? in c/localllama@poweruser.forum

[–] multiverse_fan@alien.top 1 points 2 years ago

If I had the money, I'd go with the cpu.

Also, I'm not sure a 4090 could run 33B modes at full precision. Wouldn't that require like 70GB of vRAM?

Goliath-120B - quants and future plans in c/localllama@poweruser.forum

[–] multiverse_fan@alien.top 1 points 2 years ago

Goliath was created by merging layers of Xwin and Euryale. (from their model card)

The layer ranges used are as follows:
- range 0, 16 Xwin 
- range 8, 24 Euryale 
- range 17, 32 Xwin 
- range 25, 40 Euryale 
- range 33, 48 Xwin 
- range 41, 56 Euryale 
- range 49, 64 Xwin 
- range 57, 72 Euryale 
- range 65, 80 Xwin

I'm not sure how the model would be reduced to 70B unless it's through removing layers. Is that what "shearing" is? I don't understand what is being pruned in that, is it layers?

For roleplay purposes, Goliath-120b is absolutely thrilling me in c/localllama@poweruser.forum

[–] multiverse_fan@alien.top 1 points 2 years ago

Cool, sounds like a good model to download and store for future when I can get access to better hardware.