this post was submitted on 04 Dec 2023
1 points (100.0% liked)

LocalLLaMA

1 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 10 months ago
MODERATORS
 

Optimum Intel int4 on iGPU UHD 770

I'd like to share the result of inference using Optimum Intel library with Starling-LM-7B Chat model quantized to int4 (NNCF) on iGPU Intel UHD Graphics 770 (i5 12600) with OpenVINO library.

I think it's quite good 16 tk/s with CPU load 25-30%. Same performance with int8 (NNCF) quantization.

This is inside a Proxmox VM with SR-IOV virtualized GPU 16GB RAM and 6 cores. I also found that the ballooning device might cause crash of the VM so I disabled it while the swap is on a zram device.

free -h output while inferencing:

total used free shared buff/cache available

Mem: 15Gi 6.2Gi 573Mi 4.7Gi 13Gi 9.3Gi

Swap: 31Gi 256Ki 31Gi

Code adapted from https://github.com/OpenVINO-dev-contest/llama2.openvino

What's your thoughts on this?

top 4 comments
sorted by: hot top controversial new old
[–] fediverser@alien.top 1 points 9 months ago

This post is an automated archive from a submission made on /r/LocalLLaMA, powered by Fediverser software running on alien.top. Responses to this submission will not be seen by the original author until they claim ownership of their alien.top account. Please consider reaching out to them let them know about this post and help them migrate to Lemmy.

Lemmy users: you are still very much encouraged to participate in the discussion. There are still many other subscribers on !localllama@poweruser.forum that can benefit from your contribution and join in the conversation.

Reddit users: you can also join the fediverse right away by getting by visiting https://portal.alien.top. If you are looking for a Reddit alternative made for and by an independent community, check out Fediverser.

[–] fallingdowndizzyvr@alien.top 1 points 9 months ago (1 children)

There are quite a few Intel projects in AI. There's also the optimized DirectML they made with Microsoft. So anything that supports DirectML should also be well optimized on Intel hardware. Both CPUs and GPUs.

[–] fakezeta@alien.top 1 points 9 months ago (1 children)

I hope that something similar emerge on Linux.

SYCL can be a candidate, like Vulkan for 3D Acceleration: it's a PITA to deal with CUDA, ROCm etc etc.

[–] fallingdowndizzyvr@alien.top 1 points 9 months ago

That's why Intel is pitching OneAPI. They want it to be the single API to bring everything together. That's why it also supports nvidia GPUs, AMD GPUs, CPUs and even FPGA.