fakezeta

joined 10 months ago
[–] fakezeta@alien.top 1 points 9 months ago (1 children)

I hope that something similar emerge on Linux.

SYCL can be a candidate, like Vulkan for 3D Acceleration: it's a PITA to deal with CUDA, ROCm etc etc.

 

Optimum Intel int4 on iGPU UHD 770

I'd like to share the result of inference using Optimum Intel library with Starling-LM-7B Chat model quantized to int4 (NNCF) on iGPU Intel UHD Graphics 770 (i5 12600) with OpenVINO library.

I think it's quite good 16 tk/s with CPU load 25-30%. Same performance with int8 (NNCF) quantization.

This is inside a Proxmox VM with SR-IOV virtualized GPU 16GB RAM and 6 cores. I also found that the ballooning device might cause crash of the VM so I disabled it while the swap is on a zram device.

free -h output while inferencing:

total used free shared buff/cache available

Mem: 15Gi 6.2Gi 573Mi 4.7Gi 13Gi 9.3Gi

Swap: 31Gi 256Ki 31Gi

Code adapted from https://github.com/OpenVINO-dev-contest/llama2.openvino

What's your thoughts on this?

[–] fakezeta@alien.top 1 points 10 months ago

Currently all the finetuned version of Mistral I've tested have a high rate of hallucination: this one also seems to have this tendency.

[–] fakezeta@alien.top 1 points 10 months ago

Added also to Ollama library: in case of need.

https://ollama.ai/fakezeta/neural-chat-7b-v3-1

 

Couldn't wait for the great TheBloke to release it so I've uploaded a Q5_K_M GGUF of Intel/neural-chat-7b-v3-1.

From some preliminary test on PISA sample questions it seems at least on par with OpenHermers-2.5-Mistral-7B

https://preview.redd.it/bkaezfb51c0c1.png?width=1414&format=png&auto=webp&s=735d0f03109488e01d65c1cf8ec676fa7e18c1d5