At first I tried using the tflite delegate to do image classification. I thought this was the way to go, since I could use the same models for the ARM NN tflite delegate, but it turns out the delegate for the NPU is so slow that the inference on the GPU is faster than on the NPU.
I then found out that I could use the KSNN python implementation and this works way faster. The only hurdle I need to take is converting the models I want use and this turned out to be harder then I thought. At first the git lfs clone didn’t work due to some usage limitations that only the admin of the source file could change and then I checked out and older commit where the submodules where still used. This didn’t work neither since the contents of what should be the convert script or binary stayed like this:
version https://git-lfs.github.com/spec/v1
oid sha256:54d1b4d207921a64c603b260a92cfb10e262f838c3a1b7c797fe9365e633d7fa
size 356138264
In the end I would really like to use the tflite vx delegate so I can use the same tflite models for the GPU inference, but I hope it can work faster by applying some tweaks.
About TFLite VX Delegate, you can ask @Electr1 for help.
The paper is old but it still can use. Maybe it only has a TensorFlow example which make you misunderstand. I will contact the relevant staff to update it.
The TFLite delegate will have a slight performance drop in relation to KSNN because of the way certain operations are executed on the system. KSNN does model compilation ahead of time to pick the correct hardware resource (NPU or CPU) as certain operations could be memory bound and run faster on the CPU, The TFLite delegate however chooses to execute all the operations on the NPU’s Compute unit as long as it is a supported layer.
Could you share some details of the model you were running with tflite delegate ?
@Louis-Cheng-Liu Thank you for your response. Downloading the tar ball instead of cloning the repo fixed the issue with the convert script being empty. I can now convert tflite models, but I don’t know if I do something wrong, or the input model should be different. I tried converting a non-quantized tflite model but when using this converted model with KSNN, the predictions were way off:
Top 5 NPU predictions:
1. goldfish, Carassius auratus: 0.9985
2. hamper: 0.0002
3. king snake, kingsnake: 0.0001
4. television, television system: 0.0001
5. fox squirrel, eastern fox squirrel, Sciurus niger: 0.0000
Should I use a quantized model as input for the convert script? Is that what the quantized type parameters is about?
@Electr1 Thank you for your response. I used the mobilenet_v1_uint8 model that is provided in the vx_tflite repository. Do you want to see a comparison between the vx delegate that utilizes the NPU and the ARM NN that utilizes the GPU?
import torchvision
import torch
import ai_edge_torch
# Initialize and convert mobilenet_v3_large model
mobilenet = torchvision.models.mobilenet_v3_large(
torchvision.models.MobileNet_V3_Large_Weights.IMAGENET1K_V1).eval()
# Use the convert method from the AI Edge Torch library to convert the PyTorch model.
sample_input = (torch.randn(1, 3, 224, 224),)
edge_model = ai_edge_torch.convert(mobilenet.eval(), sample_input)
# Export and save the converted model in the .tflite format for future use.
edge_model.export('./mobilenet.tflite')
I then called the convert script with multiple combinations of arguments:
Thank you for your input @Louis-Cheng-Liu. I didn’t know I had to provide a quantized dataset with the --source-files option. I retried the conversion with a set of 500 quantized images, but the results got worse, i think:
Is self._VAR[0] 128? If yes, there is nothing wrong.
Could you provide your model? If not, you can refer the doc in aml_npu_sdk/docs/en/NN Tool FAQ (0.5).pdf Section 4.2. It maybe help you find the issue.
Testing was conducted previously and we know the performance difference that vx_tflite has however I don’t have access to those documents at the moment
The tflite model I tried to convert is available here.
I’m very thankful for your support, but at this point I’m not in the position to invest more time in the use of this NPU and will focus on the use of the ARM NN delegate to utilize the CPU/GPU.