VIM3 NPU 卷积核心数量

zhangkai · August 26, 2020, 8:48am

AMLNN Convolution Acceleration Tips文档中说，卷积输出通道数是卷积核心数量的整数倍，有助于提高推理性能。请问，A311D芯片中的NPU内部，有多少个卷积核心？或者说，针对A311D芯片，卷积输出通道数设置为多少能最大化地利用NPU。

谢谢！

numbqq · August 26, 2020, 9:45am

具体信息你可以看下数据手册。

zhangkai · August 27, 2020, 5:19am

谢谢。
但是这个手册里面并没有写NPU中convolution core的数量，只是在第99页给出768MAC per cycle（int8）。现在我遇到的问题是，A311D的NPU标称算力为5TOP，但是跑我自己的网络，不算ARM端前处理和后处理的过程，NPU实际算力只有0.5TOP左右。我想了解的是如何设计网络架构，以最大化利用NPU中的计算单元?

Archangel1235 · August 27, 2020, 7:29pm

MACs are the convolution cores. Convolution is basically multiply and add multiples times.

by the way how did you calculate the performance…? by 0.5 TOPS do you mean your model requires 0.5TOPS or did you do calculation based on inference timings.

Batching inputs is one way throughput can be increased, but I am not sure input batching is supported in A311D.

zhangkai · August 28, 2020, 1:18am

When I deploy my custom CNN to A311D, the inference time is 16ms. The total computation of my CNN is 8GOP. (1000ms/16ms)*8GOP=500GOP=0.5TOP. So the practice performance of NPU is 0.5TOPS. To my surprise, such a big gap between practice performance(0.5TOPS) and peak performance(5TOPS).

I want to know some tricks to design my network architecture to fully utilise the NPU calculation resources.

Thanks!

Archangel1235 · August 29, 2020, 10:28am

I am not sure about this… do note memory bandwidth to the NPU also affects performance, NPU in A311D has only 1MB of SRAM, any models larger than this should be fetched from normal system memory… that will limit performance…

This performance difference can been seen when comparing A311D NPU with google coral… Google coral has 8MB SRAM so running MobilenetSSDV2 in google coral (120+ fps) is significantly faster than A311D NPU (75 fps).

Try pruning your model to see if you notice any improvement…

I hope Amlogic/Khadas implements batched inference that has potential to improve performance a lot