I have carried profiling for yolov3 and yolov3-tiny ( operation times usage) for the recent SDK and recent VIM3 image as recommended by @Frankhere. excerpt from yolov3 profiling log[1]
....
execution time: 550277 us
Run the 1 time: 42611.00ms or 42611240.00us
vxProcessGraph execution time:
Total 42611.00ms or 42611256.00us
Average 42611.26ms or 42611256.00us
--- Top5 ---
12: 71494644084506624.000000
168: 71494644084506624.000000
182: 71494644084506624.000000
183: 71494644084506624.000000
184: 71494644084506624.000000
turned powerManagement on for CNN_PERF=1
....
execution time: 546441 us
Run the 1 time: 16074.00ms or 16074536.00us
vxProcessGraph execution time:
Total 16074.00ms or 16074552.00us
Average 16074.55ms or 16074552.00us
--- Top5 ---
15305: 7.750000
15292: 6.250000
3668: 5.750000
3669: 5.500000
3655: 4.500000
We really wonder why the execution times are so long (ie 16s for yolotiny and 46s for yolo standard). Either thats the profiling extending the computation or the nets are much slower than expected which doesnt make too much sense as the black box network has much higher frame rates (upto 25fps); so we should have execution times below 40ms / 0.04s.
Do you thing it is normal? Can you please look at this and guide if there is anything recommended to enhance this time? @Frank@numbqq
Regards,
Sajjad
@enggsajjad In the latest version of the SDK, I have removed the support of this tool. The running result of this tool is abnormal and cannot be used as an indicator. Maybe you can add power off from the code to judge the running time
Is it something inside the code? It would be really helpful is you guide me how to add power off from the code to judge the running time?
Regards,
Sajjad
@enggsajjad There’s no problem with the SDK, I think it’s your conversion code or your processing that’s wrong. I don’t know what changes you made.
@enggsajjad I haven’t reproduced the problem here, and you don’t have this problem when you convert other models, you may still need to find out the reason yourself
I just wanted to join in the discusion. My team and I are also looking into profiling on khadas in order to optimize our networks for the architecture using execution time profiling.
I am seeing a similar behavior as described by @enggsajjad but with a very simple network (mnist).
The execution time w/o profiling is 2-4ms.
The execution time w profiling is about 5x, clocking in around 8-10ms.
I was wondering maybe the different times are related to the conversion options we are commenting out to make the profiling work ie. removing the following lines from 2_export_case_code.sh
I also tried including the optimize option which yields a working output.
However, that did not change the longer runtime for profiling.
Now I think a longer execution time make sense when there is a lot of additional (profiling) information shuffled around the bus etc. However, since we dont understand the relationship bw the profiling time and the true time, we cannot derive some meaningful conclusion from it.
Would be super thankful if you can shed some light on this? I can also provide the mnist code if this is helpful for you.
(I am using kernel 4.9.241 from Dec 17,21 and the associated firmware iso).