OpenCL support for NPU on Vim3

Hi,

I have tried using NPU in Vim3 basic with OpenCL libraries. By changing the symbolic links, I could get the following in clinfo:

$ clinfo
	Number of platforms                               1
	  Platform Name                                   Vivante OpenCL Platform
	  Platform Vendor                                 Vivante Corporation
	  Platform Version                                OpenCL 3.0 V6.4.8.7.415784
	  Platform Profile                                FULL_PROFILE
	  Platform Extensions                             cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_il_program cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics
	  Platform Host timer resolution                  0ns
	
	  Platform Name                                   Vivante OpenCL Platform
	Number of devices                                 1
	  Device Name                                     Vivante OpenCL Device VIPNano-QI.7120.0000
	  Device Vendor                                   Vivante Corporation
	  Device Vendor ID                                0x564956
	  Device Version                                  OpenCL 3.0
	  Driver Version                                  OpenCL 3.0 V6.4.8.7.415784
	  Device OpenCL C Version                         OpenCL C 1.2
	  Device Type                                     GPU
	  Device Profile                                  FULL_PROFILE
	  Device Available                                Yes
	  Compiler Available                              Yes
	  Linker Available                                Yes
	  Max compute units                               1
	  Max clock frequency                             800MHz
	  Device Partition                                (core)
	    Max number of sub-devices                     0
	    Supported partition types                     None
	    Supported affinity domains                    (n/a)
	  Max work item dimensions                        3
	  Max work item sizes                             256x256x256
	  Max work group size                             256
	  Preferred work group size multiple              4
	  Max sub-groups per work group                   0
	  Preferred / native vector sizes
	    char                                                 4 / 4
	    short                                                4 / 4
	    int                                                  4 / 4
	    long                                                 4 / 4
	    half                                                 4 / 4        (cl_khr_fp16)
	    float                                                4 / 4
	    double                                               0 / 0        (n/a)
	  Half-precision Floating-point support           (cl_khr_fp16)
	    Denormals                                     No
	    Infinity and NANs                             Yes
	    Round to nearest                              Yes
	    Round to zero                                 Yes
	    Round to infinity                             No
	    IEEE754-2008 fused multiply-add               No
	    Support is emulated in software               No
	  Single-precision Floating-point support         (core)
	    Denormals                                     No
	    Infinity and NANs                             Yes
	    Round to nearest                              Yes
	    Round to zero                                 Yes
	    Round to infinity                             No
	    IEEE754-2008 fused multiply-add               No
	    Support is emulated in software               No
	    Correctly-rounded divide and sqrt operations  No
	  Double-precision Floating-point support         (n/a)
	  Address bits                                    32, Little-Endian
	  Global memory size                              268435456 (256MiB)
	  Error Correction support                        Yes
	  Max memory allocation                           134217728 (128MiB)
	  Unified memory for Host and Device              Yes
	  Shared Virtual Memory (SVM) capabilities        (core)
	    Coarse-grained buffer sharing                 No
	    Fine-grained buffer sharing                   No
	    Fine-grained system sharing                   No
	    Atomics                                       No
	  Minimum alignment for any data type             128 bytes
	  Alignment of base address                       2048 bits (256 bytes)
	  Preferred alignment for atomics
	    SVM                                           0 bytes
	    Global                                        0 bytes
	    Local                                         0 bytes
	  Max size for global variable                    0
	  Preferred total size of global vars             0
	  Global Memory cache type                        Read/Write
	  Global Memory cache size                        16384 (16KiB)
	  Global Memory cache line size                   64 bytes
	  Image support                                   Yes
	    Max number of samplers per kernel             16
	    Max size for 1D images from buffer            65536 pixels
	    Max 1D or 2D image array size                 8192 images
	    Max 2D image size                             8192x8192 pixels
	    Max 3D image size                             8192x8192x8192 pixels
	    Max number of read image args                 128
	    Max number of write image args                8
	    Max number of read/write image args           0
	  Max number of pipe args                         0
	  Max active pipe reservations                    0
	  Max pipe packet size                            0
	  Local memory type                               Global
	  Local memory size                               32768 (32KiB)
	  Max number of constant args                     9
	  Max constant buffer size                        65536 (64KiB)
	  Max size of kernel argument                     1024
	  Queue properties (on host)
	    Out-of-order execution                        Yes
	    Profiling                                     Yes
	  Queue properties (on device)
	    Out-of-order execution                        No
	    Profiling                                     No
	    Preferred size                                0
	    Max size                                      0
	  Max queues on device                            0
	  Max events on device                            0
	  Prefer user sync for interop                    Yes
	  Profiling timer resolution                      1000ns
	  Execution capabilities
	    Run OpenCL kernels                            Yes
	    Run native kernels                            No
	    Sub-group independent forward progress        No
	    IL version                                    SPIR-V_1.5
	  printf() buffer size                            1048576 (1024KiB)
	  Built-in kernels                                (n/a)
	  Device Extensions                               cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_il_program cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics
	
	NULL platform behavior
	  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  No platform
	  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [P0]
	  clCreateContext(NULL, ...) [default]            Success [P0]
	  clCreateContextFromType(NULL, CL_DEVICE_TYPE_DEFAULT)  Success (1)
	    Platform Name                                 Vivante OpenCL Platform
	    Device Name                                   Vivante OpenCL Device VIPNano-QI.7120.0000
	  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
	  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
	    Platform Name                                 Vivante OpenCL Platform
	    Device Name                                   Vivante OpenCL Device VIPNano-QI.7120.0000
	  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
	  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
	  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
	    Platform Name                                 Vivante OpenCL Platform
	    Device Name                                   Vivante OpenCL Device VIPNano-QI.7120.0000
	        NOTE:   your OpenCL library only supports OpenCL 2.2,
	                but some installed platforms support OpenCL 3.0.
	                Programs using 3.0 features may crash
	                or behave unexpectedly

I could run some OpenCL application for matrix manipulation which worked fine as well. Still, when I tried to run Tensorflow benchmark tool, the latency seem to have increased way too much (0.5-1 FPS) in comparison with the GPU - 10 FPS! Also, with some of the models (such as Mediapipe BlazePose), I got the following error corresponding to CL_BUILD_PROGRAM _FAILURE:

ERROR: Failed to build program executable - Build program failure(82:0) : error : syntax error at ‘[’
(82:0) : error : syntax error at ‘[’

The tool used with both OpenCL-GPU and OpenCL-NPU are the same with only difference being the backend - OpenCL libraries and the NPU.

  • Could you please clarify if the OpenCL support for NPU on Vim3 is intact?
  • Is there any sort of limitation OpenCL support for Vivante NPU causing this poor performance?

There is no OpenCL support for NPU.

@numbqq Could you please detail on the lack of OpenCL support?

  • VeriSilicon seems to offer OpenCL support for its NPUs as per - Vivante NPU IP
  • I was able to use one of the libOpenCL.so and get clinfo as well as run CL kernels programs on the NPU

So, is it that Khadas Vim3 is not having OpenCL support or there is some support lacking from VeriSilicon? Is there any plan to support OpenCL on NPU in near future?