I finally managed to build llama.cpp on Windows on ARM running on a Surface Pro X with the Qualcomm 8cx chip. Why bother with this instead of running it under WSL? It lets you run the largest models that can fit into system RAM without WSL Hyper-V overhead.
I didn't notice any speed difference but the extra available RAM means I can use 7B Q5_K_M GGUF models now instead of Q3. Typical output speeds are 4 t/s to 5 t/s.
Steps:
-
Install MSYS2. The installer package has x64 and ARM64 binaries included.
-
Run clangarm64. When you're in the shell, run these commands to install the required build packages:
- pacman -Suy
- pacman -S mingw-w64-clang-aarch64-clang
- pacman -S cmake
- pacman -S make
- pacman -S git
- Clone git repo and set up build environment. You need to make ARM64 clang appear as gcc by setting the flags below.
- git clone
- cd llama.cpp
- mkdir build
- cd build
- export CC=/clangarm64/bin/cc
- export CXX=/clangarm64/bin/c++
- Build llama.cpp.
- cmake ..
- cmake --build . --config Release
- Run main
If you're lucky, most of the package should build fine, but on my machine the quantizer .exe failed to build. I tried using ARM's own GNU toolchain compiler but I kept getting build errors.
There should be a way to get NPU-accelerated model runs using the Qualcomm QNN SDK, Microsoft's ONNX runtime and ONNX models but I got stuck in dependency hell in Visual Studio 2022. I'm not a Windows developer and trying to combine x86, x64 and ARM64 compilers and python binaries is way beyond me.