this post was submitted on 22 Nov 2023
1 points (100.0% liked)

LocalLLaMA

3 readers
1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago
MODERATORS
 

Hey everyone,

I've been exploring running lightweight machine learning models (LLMS) on Android devices and came across the Neural Networks API, which seems promising for leveraging neural silicon ASICs.

I own a Google Pixel 7, which features an EdgeTPU integrated into the chip. I'm curious if anyone here has experience or success stories using this API for AI inference, particularly in comparison to traditional CPU performance.

My main interest lies in understanding the practical performance gains when using the EdgeTPU for AI tasks. Does it significantly outperform the CPU in terms of inference speed or efficiency?

I'm especially keen to hear from those who have experimented with similar setups or have insights into optimizing LLMS on Android devices using this technology.

Thanks in advance for your insights and experiences!

top 3 comments
sorted by: hot top controversial new old
[–] Combinatorilliance@alien.top 1 points 11 months ago

I'm very interested in learning more as well.

Do you know how these edge tpus compare to the coral tpu? There are some people who tried it here on localllama

[–] phree_radical@alien.top 1 points 11 months ago

I dipped my toes in while comparing different methods of running Whisper on Android, and learned that they don't intend developers to use NNAPI directly, but instead use a solution like TensorFlow Lite or PyTorch Mobile, which detects support and implements delegates which it may decide to use depending on the most efficient scenario. A developer needs to convert/"optimize" a model so that it doesn't use any unsupported operations, but there's also size considerations, like the TPU and other areas probably don't have that much memory just yet

[–] GlobalRevolution@alien.top 1 points 11 months ago

Yup, it definitely will help speed up inference on models you can get working.

My personal recommendation is to start with something like PyTorch Mobile or Tensorflow Lite (whichever you prefer). The main benefit is that you can take a model in PyTorch and compile it down to a representation that will use the NN API

You can pretty quickly use the examples in this repo to try out running a language model like BERT. It will also show you the process of converting a model and running it in your phone.

https://github.com/pytorch/android-demo-app

If you're going after maximum performance on a particular model then it might make more sense to learn the NN API directly try to build it yourself. Personally I would probably try to work with the open source community to add an NN API backend in llama.cpp

https://github.com/ggerganov/llama.cpp/issues/2687