NPU Utilization reporting gaps (spikes and 0%) on new VIM4 during YOLOv8 inference

Hi everyone,

I am currently profiling the new VIM4 using YOLOv8s via the KSNN API.

While monitoring NPU utilization through the /sys/class/adla/adla0/device/debug/utilization node, I’ve encountered an inconsistent data pattern. Even during continuous inference, the utilization values fluctuate like this: 34% -> 0% -> 0% -> 0% -> 34% ...

It appears as though the NPU only reports activity in short bursts followed by several “0%” cycles, even though the workload should be steady.

I have two main questions:

  1. Why are there so many 0% values? Is this a sampling/aliasing issue caused by the driver’s 300ms dpm_period, or is the NPU hardware actually entering an idle/sleep state between frames due to software stack overhead?

  2. How can I improve/saturate NPU utilization? My FPS for yolov8s is around 26, but the NPU seems to be idling frequently. Are there specific driver tweaks, batching methods, or multi-threading strategies for the new VIM4 to keep the NPU more consistently active?

For context, I have checked /proc/interrupts and the interrupt counts are increasing steadily, which confirms the hardware is processing tasks. However, the utilization reporting remains fragmented.

Any advice from the community or the Khadas team would be very helpful!

@manjookim can you try sharing your code, since not all the code runs on the NPU and some scalar pre-process/post-process operations are performed on the CPU the time in-between inference runs should reflect as NPU idle time.

Hello, @Electr1

import numpy as np
import os
import argparse
import json
import cv2 as cv
from ksnn.api import KSNN
import time
import sys

OBJ_THRESH = 0.001 
NMS_THRESH = 0.45
mean = [0, 0, 0]
var = [255]
NUM_CLS = 80
num_images = 100

constant_martix = np.array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]]).T

output_dir = "detection_results_vis"
os.makedirs(output_dir, exist_ok=True)

def sigmoid(x): return 1 / (1 + np.exp(-x))

def softmax(x, axis=0):
    x = np.exp(x)
    return x / x.sum(axis=axis, keepdims=True)

def letterbox(img, new_shape=(640, 640), color=(114, 114, 114)):
    shape = img.shape[:2] # [height, width]
    r = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
    
    new_unpad = (int(round(shape[1] * r)), int(round(shape[0] * r)))
    dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]
    
    dw /= 2
    dh /= 2
    
    if shape[::-1] != new_unpad:
        img = cv.resize(img, new_unpad, interpolation=cv.INTER_LINEAR)
    
    top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
    left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
    
    img = cv.copyMakeBorder(img, top, bottom, left, right, cv.BORDER_CONSTANT, value=color)
    return img, r, (left, top)


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--library", required=True)
    parser.add_argument("--model", required=True)
    parser.add_argument("--dataset", required=True)
    args = parser.parse_args()

    yolov8 = KSNN('VIM4')
    yolov8.nn_init(library=args.library, model=args.model, level=0)

    results_json = []
    img_list = [f for f in os.listdir(args.dataset) if f.endswith('.jpg')][:num_images]

    total = 0

    for idx, img_name in enumerate(img_list):
        picture = os.path.join(args.dataset, img_name)
        image_id = int(img_name.split('.')[0])

        orig_img = cv.imread(picture, cv.IMREAD_COLOR)
        h, w = orig_img.shape[:2]
        img_pad, ratio, (pad_left, pad_top) = letterbox(orig_img, (640, 640))

        img_rgb = cv.cvtColor(img_pad, cv.COLOR_BGR2RGB)

        start = time.time()
        
        data = yolov8.nn_inference(img_rgb, input_shape=(640, 640, 3), input_type="RGB", 
                                   output_shape=[(40, 40, 144), (80, 80, 144), (20, 20, 144)],  #DET
                                   #output_shape=[(8400,116), (160,160,32)], #SEG
                                   #output_shape=[(8400,56)], #Pose
                                   output_type="FLOAT")
        
        end = time.time()

        total += (end - start)
    
    print(f"Avg Inference time per image : {(total / len(img_list))*1000:.2f} ms")
    print(f"FPS : {len(img_list)/total:.2f}")

    yolov8.nn_destory_network()
    
    sys.exit(0)

This is the inference code I’ve been running on my new VIM4.

The model I am using is YOLOv8s, which has been converted for the RGB input format.

@manjookim all the lines highlighted in the following codeblock are part of the pre-processing, and will be running on the CPU and during that time the NPU will be inactive.

for idx, img_name in enumerate(img_list):
+       picture = os.path.join(args.dataset, img_name)
+       image_id = int(img_name.split('.')[0])

+       orig_img = cv.imread(picture, cv.IMREAD_COLOR)
+       h, w = orig_img.shape[:2]
+       img_pad, ratio, (pad_left, pad_top) = letterbox(orig_img, (640, 640))
+
+       img_rgb = cv.cvtColor(img_pad, cv.COLOR_BGR2RGB)

        start = time.time()
        
        data = yolov8.nn_inference(img_rgb, input_shape=(640, 640, 3), input_type="RGB", 
                                   output_shape=[(40, 40, 144), (80, 80, 144), (20, 20, 144)],  #DET
                                   #output_shape=[(8400,116), (160,160,32)], #SEG
                                   #output_shape=[(8400,56)], #Pose
                                   output_type="FLOAT")
        
        end = time.time()

        total += (end - start)

However there is a solution if you want to maximize NPU throughput, you simply need to pipeline the processing, using a thread for preprocessing images and storing it in a buffer/list/queue, and another thread that fetches from the buffer and run’s inference using the NPU, this will ensure the NPU is constantly fed data to work with and will give you a faster processing time.

Also I believe KSNN supports batch processing so multiple images can be processed with one call of the ksnn.nn_inference function call, but that needs to be verified

1 Like

@Electr1 Thanks for the help! I’ll take your advice and dive deeper into the code to optimize the pipeline further.

1 Like

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.