TFlite delgate or KSNN model conversion

Which system do you use? Android, Ubuntu, OOWOW or others?

Ubuntu

Which version of system do you use? Please provide the version of the system here:

$ uname -a
Linux Khadas 5.15.137 #1.7.4 SMP PREEMPT Wed Apr 23 10:46:01 CST 2025 aarch64 aarch64 aarch64 GNU/Linux

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=24.04
DISTRIB_CODENAME=noble
DISTRIB_DESCRIPTION=“Ubuntu 24.04.2 LTS”

Please describe your issue below:

First the two questions and second the background of the questions.

Questions:

  1. Why is the tflite vx delegate so much slower than the KSNN implementation when running inference on the NPU?
  2. Is the page about model conversion for KSNN outdated, since I cannot get it to work by following the steps on Instructions for KSNN conversion tool [Khadas Docs].

Background:

  1. At first I tried using the tflite delegate to do image classification. I thought this was the way to go, since I could use the same models for the ARM NN tflite delegate, but it turns out the delegate for the NPU is so slow that the inference on the GPU is faster than on the NPU.
  2. I then found out that I could use the KSNN python implementation and this works way faster. The only hurdle I need to take is converting the models I want use and this turned out to be harder then I thought. At first the git lfs clone didn’t work due to some usage limitations that only the admin of the source file could change and then I checked out and older commit where the submodules where still used. This didn’t work neither since the contents of what should be the convert script or binary stayed like this:
version https://git-lfs.github.com/spec/v1
oid sha256:54d1b4d207921a64c603b260a92cfb10e262f838c3a1b7c797fe9365e633d7fa
size 356138264

In the end I would really like to use the tflite vx delegate so I can use the same tflite models for the GPU inference, but I hope it can work faster by applying some tweaks.

Hello @erikh ,

About TFLite VX Delegate, you can ask @Electr1 for help.

The paper is old but it still can use. Maybe it only has a TensorFlow example which make you misunderstand. I will contact the relevant staff to update it.

You can try the following command to convert.

$ ./convert --model-name mobilenet_ssd \
>           --platform tflite \
>           --model xxx.tflite \
>           --mean-values '127.5 127.5 127.5 0.007843137' \
>           --quantized-dtype dynamic_fixed_point \
>           --qtype int8 \
>           --source-files ./data/dataset/dataset0.txt \
>           --kboard VIM3 --print-level 0

Remember modify mean-values for your model. More information the convert parameter, you can refer this doc ksnn/docs/ksnn_user_usage_v1.4.pdf at master · khadas/ksnn.


If you git clone fail, you can try you download package.

If KSNN still has problem, please ask me for help and provide your model.

Hi @erikh

The TFLite delegate will have a slight performance drop in relation to KSNN because of the way certain operations are executed on the system. KSNN does model compilation ahead of time to pick the correct hardware resource (NPU or CPU) as certain operations could be memory bound and run faster on the CPU, The TFLite delegate however chooses to execute all the operations on the NPU’s Compute unit as long as it is a supported layer.

Could you share some details of the model you were running with tflite delegate ?

Cheers

@Louis-Cheng-Liu Thank you for your response. Downloading the tar ball instead of cloning the repo fixed the issue with the convert script being empty. I can now convert tflite models, but I don’t know if I do something wrong, or the input model should be different. I tried converting a non-quantized tflite model but when using this converted model with KSNN, the predictions were way off:

Top 5 NPU predictions:
1. window screen: 7.9375
2. shower curtain: 6.8125
3. window shade: 6.8125
4. binder, ring-binder: 5.6875
5. fire screen, fireguard: 5.5625

where it should be:

Top 5 NPU predictions:
1. goldfish, Carassius auratus: 0.9985
2. hamper: 0.0002
3. king snake, kingsnake: 0.0001
4. television, television system: 0.0001
5. fox squirrel, eastern fox squirrel, Sciurus niger: 0.0000

Should I use a quantized model as input for the convert script? Is that what the quantized type parameters is about?

@Electr1 Thank you for your response. I used the mobilenet_v1_uint8 model that is provided in the vx_tflite repository. Do you want to see a comparison between the vx delegate that utilizes the NPU and the ARM NN that utilizes the GPU?

Hello @erikh ,

Could you provide your model and the command for convertion? From your information, i am not sure what wrong with it.

Hello @Louis-Cheng-Liu,

I obtained the tflite model using this script:

import torchvision
import torch
import ai_edge_torch

# Initialize and convert mobilenet_v3_large model
mobilenet = torchvision.models.mobilenet_v3_large(
    torchvision.models.MobileNet_V3_Large_Weights.IMAGENET1K_V1).eval()

# Use the convert method from the AI Edge Torch library to convert the PyTorch model.
sample_input = (torch.randn(1, 3, 224, 224),)
edge_model = ai_edge_torch.convert(mobilenet.eval(), sample_input)

# Export and save the converted model in the .tflite format for future use.
edge_model.export('./mobilenet.tflite')

I then called the convert script with multiple combinations of arguments:

./convert \
--model-name mobilenet_v3_aa \
--platform tflite \
--model ./models/mobilenet.tflite \
--mean-values '127.5 127.5 127.5 0.007843137' \
--quantized-dtype asymmetric_affine \
--source-files ./data/dataset/dataset0.txt \
--kboard VIM3 --print-level 0

./convert \
--model-name mobilenet_v3_int8 \
--platform tflite \
--model ./models/mobilenet.tflite \
--mean-values '127.5 127.5 127.5 0.007843137' \
--quantized-dtype dynamic_fixed_point \
--qtype int8 \
--source-files ./data/dataset/dataset0.txt \
--kboard VIM3 --print-level 0

./convert \
--model-name mobilenet_v3_int16 \
--platform tflite \
--model ./models/mobilenet.tflite \
--mean-values '127.5 127.5 127.5 0.007843137' \
--quantized-dtype dynamic_fixed_point \
--qtype int16 \
--source-files ./data/dataset/dataset0.txt \
--kboard VIM3 --print-level 0

Why isn’t it possible to convert to a non-quantized model?

@Louis-Cheng-Liu If it helps your diagnosis, here are the top 5 predictions of the 3 converted models:

$ ./npu_run mobilenet_v3_aa
Creating NPU Classifier...
 |---+ KSNN Version: v1.4 +---| 
Start init neural network...
Done. inference :  0.01725459098815918
NPU inference time: 0.023158788681030273

Top 5 NPU predictions:
1. window screen: 7.3075
2. window shade: 7.1584
3. fire screen, fireguard: 6.4624
4. shower curtain: 5.6670
5. cellular telephone, cellular phone, cellphone, cell, mobile phone: 4.8717


$ ./npu_run mobilenet_v3_int8
Creating NPU Classifier...
 |---+ KSNN Version: v1.4 +---| 
Start init neural network...
Done. inference :  0.017688989639282227
NPU inference time: 0.024018526077270508

Top 5 NPU predictions:
1. window screen: 7.9375
2. shower curtain: 6.8125
3. window shade: 6.8125
4. binder, ring-binder: 5.6875
5. fire screen, fireguard: 5.5625


$ ./npu_run mobilenet_v3_int16
Creating NPU Classifier...
 |---+ KSNN Version: v1.4 +---| 
Start init neural network...
Done. inference :  0.025995731353759766
NPU inference time: 0.03178119659423828

Top 5 NPU predictions:
1. window screen: 7.9998
2. fire screen, fireguard: 6.1021
3. shower curtain: 5.9553
4. lighter, light, igniter, ignitor: 5.4478
5. binder, ring-binder: 5.3276

Hello @erikh ,

How many quantized images have you use?(Parameter source-files) Suggest using about 200-500 images.

About non-quantized model, it does not support now.

Could you provide your the original codes ./npu_run. Have you do normalization for input before nn_inference.

Thank you for your input @Louis-Cheng-Liu. I didn’t know I had to provide a quantized dataset with the --source-files option. I retried the conversion with a set of 500 quantized images, but the results got worse, i think:

$ ./npu_run mobilenet_v3_int8
Creating NPU Classifier...
 |---+ KSNN Version: v1.4 +---| 
Start init neural network...
Done. inference :  0.01970076560974121
NPU inference time: 0.025630950927734375

Top 5 NPU predictions:
1. window screen: 12.6250
2. window shade: 7.8750
3. shower curtain: 7.3750
4. fire screen, fireguard: 6.0000
5. binder, ring-binder: 5.3750
Free RAM: 1.33 GB

It’s a bit hard to share all the code behind ./npu_run, but the core is this:

    def _init_neural_network(self):
        self._ksnn = KSNN('VIM3')
        print(f' |---+ KSNN Version: {self._ksnn.get_nn_version()} +---| ')
        print('Start init neural network...')
        self._ksnn.nn_init(library=self._lib, model=self._model, level=0)

    def _preprocess_image(self, image_path):
        cv_img = []
        orig_img = cv.imread(image_path, cv.IMREAD_COLOR)
        img = cv.resize(orig_img, (224, 224)).astype(np.float32)
        img[:, :, 0] = img[:, :, 0] - self._MEAN[0]
        img[:, :, 1] = img[:, :, 1] - self._MEAN[1]
        img[:, :, 2] = img[:, :, 2] - self._MEAN[2]
        img = img / self._VAR[0]

        img = cv.cvtColor(img, cv.COLOR_BGR2RGB)
        cv_img.append(img)

        return cv_img

    def classify(self, image_path):
        """Classify an image and return probabilities and top-5 indices."""
        # Preprocess the image
        image = self._preprocess_image(image_path)

        # Run inference
        start = time.time()
        outputs = self._ksnn.nn_inference(
            image,
            platform='TFLITE',
            reorder='0 1 2',
            output_tensor=3,
            output_format=output_format.OUT_FORMAT_FLOAT32
        )
        end = time.time()
        print('Done. inference : ', end - start)

        # Flatten output if needed
        probabilities = outputs[0].reshape(-1)  # Flatten the output
        if len(probabilities) == 1001:
            probabilities = probabilities[1:]  # Remove background class

        # Get top-5 indices
        top5_indices = np.argsort(probabilities)[-5:][::-1]

        return probabilities, top5_indices

Hello @erikh ,

Is self._VAR[0] 128? If yes, there is nothing wrong.

Could you provide your model? If not, you can refer the doc in aml_npu_sdk/docs/en/NN Tool FAQ (0.5).pdf Section 4.2. It maybe help you find the issue.

Testing was conducted previously and we know the performance difference that vx_tflite has however I don’t have access to those documents at the moment :slightly_smiling_face:

Hello @Louis-Cheng-Liu,
Indeed, self._VAR[0] == 128

The tflite model I tried to convert is available here.

I’m very thankful for your support, but at this point I’m not in the position to invest more time in the use of this NPU and will focus on the use of the ARM NN delegate to utilize the CPU/GPU.