multiverse_fan

joined 10 months ago
 

It's working great so far. Just wanted to share and spread awareness that running multiple instances of webui (oobabooga) is basically a matter of having enough ram. I just finished running three models simultaneously (taking turns of course). Only offloaded one layer to gpu per model, used 5 threads per model, and all contexts were set to 4K. (the computer has 6 core cpu, 6GB vram, 64GB ram)

The models used were:

dolphin-2.2.1-ashhlimarp-mistral-7b.Q8_0.gguf

causallm_7b.Q5_K_M.gguf

mythomax-l2-13b.Q8_0.gguf (i meant to load a 7B on this one though)

I like it because it's similar to the group chat on character.ai but without the censorship and I can edit any of the responses. Downsides are having to copy/paste between all the instances of the webui, and it seems that one of the models was focusing on one character instead of both. Also, I'm not sure what the actual context limit would be before the gpu would go out of memory.

https://preview.redd.it/8i6wwjjtt54c1.png?width=648&format=png&auto=webp&s=26adca2a850f62165301390cdd4ba11548447c0d

https://preview.redd.it/3c9z5ee9u54c1.png?width=1154&format=png&auto=webp&s=210d7c67bcf0efafeb3f328e76199f13159dae64

https://preview.redd.it/lt8aizhbu54c1.png?width=1154&format=png&auto=webp&s=d24f8b2bf899084bbdb11d73e34b5564b629e0be

https://preview.redd.it/8lbl4nzeu54c1.png?width=1154&format=png&auto=webp&s=a81b8f1d8630e3d17ad37885915f8c7e3077584c

[–] multiverse_fan@alien.top 1 points 9 months ago

What would have happened if ChatGPT was invented in the 17th century? MonadGPT is a possible answer.

TheBloke/MonadGPT-GGUF

[–] multiverse_fan@alien.top 1 points 9 months ago

I have an older 6GB 1660 and get like 0.3 t/s on a q2 quant of Goliath 120B. I guess I'm just thinking that comparatively your setup with a 20B model should be faster than that but I'm sure I'm missing something. I guess with offloading, the CPU plays a role as well. How many cores ya got?

 

I've tried a few of these models but it was some months ago. Have y'all seen any that can hold a conversation yet?

[–] multiverse_fan@alien.top 1 points 10 months ago

If I had the money, I'd go with the cpu.

Also, I'm not sure a 4090 could run 33B modes at full precision. Wouldn't that require like 70GB of vRAM?

[–] multiverse_fan@alien.top 1 points 10 months ago

Goliath was created by merging layers of Xwin and Euryale. (from their model card)

The layer ranges used are as follows:
- range 0, 16 Xwin 
- range 8, 24 Euryale 
- range 17, 32 Xwin 
- range 25, 40 Euryale 
- range 33, 48 Xwin 
- range 41, 56 Euryale 
- range 49, 64 Xwin 
- range 57, 72 Euryale 
- range 65, 80 Xwin

I'm not sure how the model would be reduced to 70B unless it's through removing layers. Is that what "shearing" is? I don't understand what is being pruned in that, is it layers?

[–] multiverse_fan@alien.top 1 points 10 months ago

Cool, sounds like a good model to download and store for future when I can get access to better hardware.