NPU TFLite Delegate

Which Khadas SBC do you use?

VIM3 A311D

Which system do you use? Android, Ubuntu, OOWOW or others?

Kernel 4.9 Ubuntu 20.04

Which version of system do you use? Khadas official images, self built images, or others?

gnome

Please describe your issue below:

I have tflite model for Object Detection, which I want to run on NPU to get the acceleration. I used NPU TFlite delegates as mentioned in examples. Inferences with models mentioned in examples are fast but when I tried to use our model inferences are slow (for e.g detection time 4 sec with NPU but with CPU it is 0.20 sec with our model).
Please the guide on following points.

  1. is it a problem with a model ?(PF model below)
  2. How to know that vx delagates are working or model is running on NPU ?
  3. Do I need to build the vx delegates from the source ?
  4. can I use “libvx_delegate.so” directly given in the examples with my model for object detection?

Log with NPU:
Vx delegate: allowed_cache_mode set to 0.
Vx delegate: device num set to 0.
Vx delegate: allowed_builtin_code set to 0.
Vx delegate: error_during_init set to 0.
Vx delegate: error_during_prepare set to 0.
Vx delegate: error_during_invoke set to 0.
WARNING: Fallback unsupported op 32 to TfLite
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
W [HandleLayoutInfer:291]Op 162: default layout inference pass.
[[0.21484375 0.16015625 0.09765625 0.0859375 0.07421875 0.05859375
0.05078125 0.046875 0.046875 0.046875 0.046875 0.046875
0.0390625 0.03515625 0.03515625 0.03515625 0.03515625 0.03125
0.03125 0.03125 0.03125 0.02734375 0.02734375 0.02734375
0.02734375]]
4.172280788421631 sec

Log with CPU:
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
[[0.20703125 0.16796875 0.09765625 0.078125 0.05859375 0.0546875
0.05078125 0.046875 0.04296875 0.04296875 0.04296875 0.04296875
0.0390625 0.03515625 0.03515625 0.03515625 0.03515625 0.03515625
0.03515625 0.03515625 0.03125 0.03125 0.03125 0.03125
0.02734375]]
0.19845318794250488 sec

Code and Model

@numbqq
@Louis-Cheng-Liu

Hello @Chetan_Deshmukh

You can enable NPU debugging, NPU Performance Analysis [Khadas Docs]

It seems your model is consisting of unsupported operator

It falls back to usage of XNNpack instead of using NPU here.

Please check if all the ops of your model are in the supported catergory here:

Yes you can use it.