NPU TFLite Delegate

Chetan_Deshmukh · December 15, 2023, 11:05am

Which Khadas SBC do you use?

VIM3 A311D

Which system do you use? Android, Ubuntu, OOWOW or others?

Kernel 4.9 Ubuntu 20.04

Which version of system do you use? Khadas official images, self built images, or others?

gnome

Please describe your issue below:

I have tflite model for Object Detection, which I want to run on NPU to get the acceleration. I used NPU TFlite delegates as mentioned in examples. Inferences with models mentioned in examples are fast but when I tried to use our model inferences are slow (for e.g detection time 4 sec with NPU but with CPU it is 0.20 sec with our model).
Please the guide on following points.

is it a problem with a model ?(PF model below)
How to know that vx delagates are working or model is running on NPU ?
Do I need to build the vx delegates from the source ?
can I use “libvx_delegate.so” directly given in the examples with my model for object detection?

Log with NPU:
Vx delegate: allowed_cache_mode set to 0.
Vx delegate: device num set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
WARNING: Fallback unsupported op 32 to TfLite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
[[0.21484375 0.16015625 0.09765625 0.0859375 0.07421875 0.05859375
0.05078125 0.046875 0.046875 0.046875 0.046875 0.046875
0.0390625 0.03515625 0.03515625 0.03515625 0.03515625 0.03125
0.03125 0.03125 0.03125 0.02734375 0.02734375 0.02734375
0.02734375]]
4.172280788421631 sec

Log with CPU:
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
[[0.20703125 0.16796875 0.09765625 0.078125 0.05859375 0.0546875
0.05078125 0.046875 0.04296875 0.04296875 0.04296875 0.04296875
0.0390625 0.03515625 0.03515625 0.03515625 0.03515625 0.03515625
0.03515625 0.03515625 0.03125 0.03125 0.03125 0.03125
0.02734375]]
0.19845318794250488 sec

Code and Model

@numbqq
@Louis-Cheng-Liu

Electr1 · December 15, 2023, 2:27pm

Hello @Chetan_Deshmukh

You can enable NPU debugging, NPU Performance Analysis [Khadas Docs]

It seems your model is consisting of unsupported operator

It falls back to usage of XNNpack instead of using NPU here.

Please check if all the ops of your model are in the supported catergory here:

github.com

VeriSilicon/tflite-vx-delegate/blob/main/op_status.md

__op support status for TfLite is described as fllows:__
&nbsp;

op name      |status
:------      |:-----
Add          |yes 
AveragePool2d|yes
Concatenation|yes
Conv2d       |yes
DepthwiseConv2d|yes
DepthToSpace|yes
Dequantize|yes
EmbeddingLookup|yes
Floor|yes
FullyConnected|yes
HashtableLookup|yes
L2Normalization|yes
L2Pool2d|no
LocalResponseNormalization|yes
Logistic|yes

This file has been truncated. show original

Yes you can use it.