Operation times usage too high time values

I have carried profiling for yolov3 and yolov3-tiny ( operation times usage) for the recent SDK and recent VIM3 image as recommended by @Frank here.
excerpt from yolov3 profiling log [1]

....
execution time:            550277 us
Run the 1 time: 42611.00ms or 42611240.00us
vxProcessGraph execution time:
Total   42611.00ms or 42611256.00us
Average 42611.26ms or 42611256.00us
 --- Top5 ---
 12: 71494644084506624.000000
168: 71494644084506624.000000
182: 71494644084506624.000000
183: 71494644084506624.000000
184: 71494644084506624.000000
turned powerManagement on for CNN_PERF=1

excerpt from yolov3-tiny profiling log [2]

....
execution time:            546441 us
Run the 1 time: 16074.00ms or 16074536.00us
vxProcessGraph execution time:
Total   16074.00ms or 16074552.00us
Average 16074.55ms or 16074552.00us
 --- Top5 ---
15305: 7.750000
15292: 6.250000
3668: 5.750000
3669: 5.500000
3655: 4.500000

We really wonder why the execution times are so long (ie 16s for yolotiny and 46s for yolo standard). Either thats the profiling extending the computation or the nets are much slower than expected which doesnt make too much sense as the black box network has much higher frame rates (upto 25fps); so we should have execution times below 40ms / 0.04s.

Do you thing it is normal? Can you please look at this and guide if there is anything recommended to enhance this time?
@Frank @numbqq
Regards,
Sajjad

The execution time for the same code and model gives the following timings:

============================================
yolov3 minimal execution
============================================
khadas@Khadas:~/hussain/novelsense_git/khadas/nbg/sdk_generated_nbg_compilable/yolov3_nbg_unify_416_hussainA/bin_r$ ./yolotiny ../yolov3_88.nb ~/hussain/img/416/person_416.jpg
Create Neural Network: 2869ms or 2869746us
Verify...
Verify Graph: 7ms or 7829us
Start run graph [1] times...
Run the 1 time: 94.00ms or 94545.00us
vxProcessGraph execution time:
Total   94.00ms or 94600.00us
Average 94.60ms or 94600.00us
 --- Top5 ---
15305: 10.500000
15306: 8.250000
15292: 7.750000
3668: 7.000000
15142: 7.000000

And

============================================
yolov3-tiny minimal execution
============================================
khadas@Khadas:~/hussain/novelsense_git/khadas/nbg/sdk_generated_nbg_compilable/yolov3_nbg_unify_416_hussainA/bin_r$ cd ../../yolotiny_nbg_unify_416_hussainA/bin_r/
khadas@Khadas:~/hussain/novelsense_git/khadas/nbg/sdk_generated_nbg_compilable/yolotiny_nbg_unify_416_hussainA/bin_r$ chmod +x ./yolotiny
khadas@Khadas:~/hussain/novelsense_git/khadas/nbg/sdk_generated_nbg_compilable/yolotiny_nbg_unify_416_hussainA/bin_r$ ./yolotiny ../yolotiny_88.nb ~/hussain/img/416/person_416.jpg
Create Neural Network: 501ms or 501026us
Verify...
Verify Graph: 0ms or 704us
Start run graph [1] times...
Run the 1 time: 4.00ms or 4655.00us
vxProcessGraph execution time:
Total   4.00ms or 4692.00us
Average 4.69ms or 4692.00us
 --- Top5 ---
15305: 7.750000
15292: 6.250000
3668: 5.750000
3669: 5.500000
3655: 4.500000

Here it is too fast as compared to when carried through profiling.
One thing that is different according to the khadas documentation is that:

  1. mininaml nbg code produced code is executed using .nb files
  2. while profiling is carried, export.data files are used.

Please guide @Frank
Regards,
Sajjad

@enggsajjad In the latest version of the SDK, I have removed the support of this tool. The running result of this tool is abnormal and cannot be used as an indicator. Maybe you can add power off from the code to judge the running time

1 Like

Is it something inside the code? It would be really helpful is you guide me how to add power off from the code to judge the running time?
Regards,
Sajjad

@enggsajjad

You can follow this docs. It can get you some info.

1 Like

Following the above document, after I set VIV_VX_PROFILE=1 VIV_VX_DEBUG_LEVEL=1, which produce the following (only last portion is shown):

....
***********dump layer :34***************
layer_id: 21 layer name:ConvolutionLayer operation[1]:VXNNE_OPERATOR_CONVOLUTION target:VXNNE_OPERATION_TARGET_SH.
uid: 45
abs_op_id: 34
shader kernel name: gemm_dfp_I8toI8
execution time:           3747830 us
[   273] TOTAL_READ_BANDWIDTH  (MByte): 13.016649
[   274] TOTAL_WRITE_BANDWIDTH (MByte): 0.177062
[   275] AXI_READ_BANDWIDTH  (MByte): 0.000000
[   276] AXI_WRITE_BANDWIDTH (MByte): 0.000000
[   277] DDR_READ_BANDWIDTH (MByte): 13.016649
[   278] DDR_WRITE_BANDWIDTH (MByte): 0.177062
[   279] GPUTOTALCYCLES: 2998308511
[   280] GPUIDLECYCLES: 2981068030
VPC_ELAPSETIME: 3748538
*********
Run the 1 time: 53516.00ms or 53516784.00us
vxProcessGraph execution time:
Total   53516.00ms or 53516868.00us
Average 53516.87ms or 53516868.00us
 --- Top5 ---
 12: 71494644084506624.000000
168: 71494644084506624.000000
182: 71494644084506624.000000
183: 71494644084506624.000000
184: 71494644084506624.000000
khadas@Khadas:~/hussain/op_test_yolov3_minimal_using_export.data1/bin_r_cv4$

So, VPC_ELAPSETIME: 3748538 is the execution time (3748.538ms)?

@enggsajjad This result should be incorrect. I judged that there was a problem with the model conversion.

Here it is using the same model. The result is also detected correctly. Do you still think that the model is wrongly converted?
Regards,
Sajjad

@enggsajjad Normally, this process does not take so long, maybe there is something wrong with the code.

So you mean something is wrong with your code? Meanwhile we try to convert and profile with an older/stable version of sdk.

@enggsajjad There’s no problem with the SDK, I think it’s your conversion code or your processing that’s wrong. I don’t know what changes you made.

Sorry, but I have not made any changes, just followed the steps as mentioned here:

@enggsajjad I haven’t reproduced the problem here, and you don’t have this problem when you convert other models, you may still need to find out the reason yourself

Hi @Frank,

I just wanted to join in the discusion. My team and I are also looking into profiling on khadas in order to optimize our networks for the architecture using execution time profiling.

I am seeing a similar behavior as described by @enggsajjad but with a very simple network (mnist).

  • The execution time w/o profiling is 2-4ms.
  • The execution time w profiling is about 5x, clocking in around 8-10ms.

I was wondering maybe the different times are related to the conversion options we are commenting out to make the profiling work ie. removing the following lines from 2_export_case_code.sh

–optimize VIPNANOQI_PID0X88
–viv-sdk ${ACUITY_PATH}vcmdtools
–pack-nbg-unify

I also tried including the optimize option which yields a working output.
However, that did not change the longer runtime for profiling.

Now I think a longer execution time make sense when there is a lot of additional (profiling) information shuffled around the bus etc. However, since we dont understand the relationship bw the profiling time and the true time, we cannot derive some meaningful conclusion from it.

Would be super thankful if you can shed some light on this? I can also provide the mnist code if this is helpful for you.

(I am using kernel 4.9.241 from Dec 17,21 and the associated firmware iso).

Hi @Frank , is this time difference happening due to not using these parameters while producing profiling? Please guide!
@numbqq

@enggsajjad These parameters are firmware, do not need to be modified, VIM3 is 88