Error in operation wise usage times

johndoe · January 5, 2022, 9:41am

Nevermind. The shell had restarted and that was causing this error. I exported the environment variables again. It’s working fine now

I had one doubt though. Do I have to reload the NPU module (galcore) everytime I set the environment variables?

Frank · January 5, 2022, 9:45am

@johndoe In the step reference I gave you, there is no need to reload Galcore

johndoe · January 5, 2022, 9:53am

@Frank
Heavier models (resnet50, 101, 152) are now facing segmentation fault

In this light, the doc “NN Tool FAQ” states

If the model is large, you need to carefully set the environment variables to obtain
the results of each layer

What tweaks do I need to make in the environment variables to generated outputs successfully in such cases?

Frank · January 5, 2022, 10:08am

@johndoe If your model is too large, this is not recommended. If you know how to cut the model, you can cut into several small models and test again

johndoe · January 5, 2022, 10:29am

So this method works only for small models?

Are there any particular steps for cutting a model and running this same analysis on them?

Frank · January 5, 2022, 10:31am

@johndoe I am not sure if this can only be run on a small model, but it is certain that this kind of operation is very resource intensive and may lead to insufficient resources.

I have no specific solution

johndoe · January 5, 2022, 10:36am

So I guess this same thing would be better supported in VIM4, since all other things are constant expect hardware bottlenecks, right?

Frank · January 5, 2022, 10:38am

@johndoe Information about VIM4 is still confidential, you can consult @Gouwa

johndoe · January 5, 2022, 1:35pm

Hi @Frank
One small thing. Is it possible to get TOTAL_READ_BANDWIDTH, TOTAL_WRITE_BANDWIDTH, AXI_READ_BANDWIDTH, AXI_WRITE_BANDWIDTH, DDR_READ_BANDWIDTH, DDR_WRITE_BANDWIDTH, GPUTOTALCYCLES, GPUIDLECYCLES values for each layer?

I intend to understand the utilisation of the NPU (TP, NN) on account of various operations (like convolution, pooling, etc.)

Frank · January 6, 2022, 1:03am

There is no such interface, and statistics are not realistic

johndoe · January 6, 2022, 3:56am

I set VIV_VX_PROFILE=1 and was able to get these values on a per layer basis. But there are certain inconsistencies

The AXI read/write bandwidth is always at 0
The VPC elapsed time is increasing with time
GPU idle cycles / GPU total cycles is high (0.99, which hints at a very low utilisation)
vxProcessGraph significantly higher than inference results .nb file
Assuming the outputs by inference on nb files are generated for the entire operation, the layerwise counter values should add up to that but it’s not the case

I’m attaching the link to the generated logs link

Frank · January 6, 2022, 3:59am

@johndoe When you are running, you must not only count the time, but also count the bandwidth. This will put a lot of pressure on the running of the model. I don’t think you can get all the data correctly. Counting the bandwidth of each layer is not practical or true. Many layers will be reused

johndoe · January 6, 2022, 4:01am

Ohh… got you! Then I suppose the values that are returned aren’t completely valid since running an inference with layerwise stats takes 3-4x longer than a normal inference without any dump generation. This might be getting reflected on the time and bandwidth values

Frank · January 6, 2022, 5:40am

@johndoe I am not sure about that. Maybe I will try it next week

johndoe · January 6, 2022, 5:50am

Thanks a lot @Frank !

johndoe · January 7, 2022, 10:25am

Hi @Frank

When I run inference using the .nb files with VIV_VX_DEBUG_LEVEL and VIV_VX_PROFILE as 1, I get a line that holds execution time (just before the total_bandwidth line)

D [setup_node:441]Setup node id[0] uid[0] op[NBG]
D [print_tensor:146]in(0) : id[   1] vtl[0] const[0] shape[ 3, 299, 299, 1   ] fmt[u8 ] qnt[ASM zp=128, scale=0.007812]
D [print_tensor:146]out(0): id[   0] vtl[0] const[0] shape[ 1001, 1          ] fmt[f16] qnt[NONE]
D [optimize_node:385]Backward optimize neural network
D [optimize_node:392]Forward optimize neural network
I [compute_node:327]Create vx node
D [compute_node:350]Instance node[0] "NBG" ...
Create Neural Network: 43ms or 43330us
Verify...
generate command buffer, total device count=1, core count per-device: 1, 
current device id=0, AXI SRAM base address=0xff000000
---------------------------Begin VerifyTiling -------------------------
AXI-SRAM = 1048576 Bytes VIP-SRAM = 522240 Bytes SWTILING_PHASE_FEATURES[1, 1, 0]
  0 NBG [(   0    0    0 0,        0, 0x(nil)(0x(nil), 0x(nil)) ->    0    0    0 0,        0, 0x(nil)(0x(nil), 0x(nil))) k(0 0    0,        0) pad(0 0) pool(0 0, 0 0)]

 id IN [ x  y  w   h ]   OUT  [ x  y  w  h ] (tx, ty, kpc) (ic, kc, kc/ks, ks/eks, kernel_type)
   0 NBG DD 0x(nil) [   0    0        0        0] -> DD 0x(nil) [   0    0        0        0] (  0,   0,   0) (       0,        0, 0.000000%, 0.000000%, NONE)

PreLoadWeightBiases = 1048576  100.000000%
---------------------------End VerifyTiling -------------------------
Verify Graph: 2ms or 2050us
Start run graph [1] times...
D [_check_swapped_tensors:93]Check swapped tensors
layer_id: 0 layer name:network_binary_graph operation[0]:unkown operation type target:unkown operation target.
uid: 0
abs_op_id: 0
execution time:             20613 us
[     1] TOTAL_READ_BANDWIDTH  (MByte): 67.685927
[     2] TOTAL_WRITE_BANDWIDTH (MByte): 18.234354
[     3] AXI_READ_BANDWIDTH  (MByte): 30.711409
[     4] AXI_WRITE_BANDWIDTH (MByte): 15.228935
[     5] DDR_READ_BANDWIDTH (MByte): 36.974518
[     6] DDR_WRITE_BANDWIDTH (MByte): 3.005419
[     7] GPUTOTALCYCLES: 16536584
[     8] GPUIDLECYCLES: 296902
VPC_ELAPSETIME: 20950
*********
Run the 1 time: 21.00ms or 21542.00us
vxProcessGraph execution time:
Total   21.00ms or 21567.00us
Average 21.57ms or 21567.00us
 --- Top5 ---
208: 0.716797
209: 0.041107
223: 0.016205
185: 0.006828
268: 0.005978
Exit VX Thread: 0x811b81b0

But upon running with the same using the created executable, I get

[getVXCKernelInfo(60)] Failed to open library libNNVXCBinary.so.

And when I set these as my environment variables:

export VIVANTE_SDK_DIR=/home/khadas/Just_for_get_op_time/data/vcmdtools
export LD_LIBRARY_PATH=/home/khadas/Just_for_get_op_time/data/drivers_64_exportdata
export VIV_VX_DEBUG_LEVEL=1
export VIV_VX_PROFILE=1
export CNN_PERF=0
export NN_LAYER_DUMP=0

I don’t get the execution time


PreLoadWeightBiases = 66304  6.323242%
---------------------------End VerifyTiling -------------------------
Verify Graph: 4162ms or 4162929us
Start run graph [1] times...
[     1] TOTAL_READ_BANDWIDTH  (MByte): 6.413972
[     2] TOTAL_WRITE_BANDWIDTH (MByte): 3.550582
[     3] AXI_READ_BANDWIDTH  (MByte): 4.967385
[     4] AXI_WRITE_BANDWIDTH (MByte): 3.549606
[     5] DDR_READ_BANDWIDTH (MByte): 1.446588
[     6] DDR_WRITE_BANDWIDTH (MByte): 0.000977
[     7] GPUTOTALCYCLES: 1954799
[     8] GPUIDLECYCLES: 496150
VPC_ELAPSETIME: 2670
*********
Run the 1 time: 3.00ms or 3097.00us
vxProcessGraph execution time:
Total   3.00ms or 3114.00us
Average 3.11ms or 3114.00us
 --- Top5 ---
719: 15.523582
 17: 14.951283
611: 14.808209
971: 14.450522
898: 14.235911
Exit VX Thread: 0x964071b0

Frank · January 8, 2022, 12:53am

Post your execution command

johndoe · January 8, 2022, 2:42am

./inceptionv3 inception_v3.nb ../dog_299x299.jpg

Frank · January 8, 2022, 3:00am

@johndoe

johndoe:

export VIVANTE_SDK_DIR=/home/khadas/Just_for_get_op_time/data/vcmdtools
export LD_LIBRARY_PATH=/home/khadas/Just_for_get_op_time/data/drivers_64_exportdata
export VIV_VX_DEBUG_LEVEL=1
export VIV_VX_PROFILE=1
export CNN_PERF=0
export NN_LAYER_DUMP=0

This is to use online compilation, I can compare my steps, it is not possible to use nb files

johndoe · January 8, 2022, 3:14am

Yes. When I use these environment variables and try running the inference via .export.data file, I don’t get the execution time

khadas@Khadas:~/aml_npu_app/detect_library/squeezenet1.1/bin_r_cv4$ ./detect_squeezenet1.1 ../squeezenet1.1.export.data ~/ksnn/examples/onnx/data/goldfish_224x224.jpg