Optimum Intel int4 on iGPU UHD 770
I'd like to share the result of inference using Optimum Intel library with Starling-LM-7B Chat model quantized to int4 (NNCF) on iGPU Intel UHD Graphics 770 (i5 12600) with OpenVINO library.
I think it's quite good 16 tk/s with CPU load 25-30%. Same performance with int8 (NNCF) quantization.
This is inside a Proxmox VM with SR-IOV virtualized GPU 16GB RAM and 6 cores. I also found that the ballooning device might cause crash of the VM so I disabled it while the swap is on a zram device.
free -h
output while inferencing:
total used free shared buff/cache available
Mem: 15Gi 6.2Gi 573Mi 4.7Gi 13Gi 9.3Gi
Swap: 31Gi 256Ki 31Gi
Code adapted from https://github.com/OpenVINO-dev-contest/llama2.openvino
What's your thoughts on this?
I hope that something similar emerge on Linux.
SYCL can be a candidate, like Vulkan for 3D Acceleration: it's a PITA to deal with CUDA, ROCm etc etc.