VIM4 USB 3 failure when under load

Which system do you use? Android, Ubuntu, OOWOW or others?

Armbian jammy (latest, 5.15.119)

Which version of system do you use? Khadas official images, self built images, or others?

Armbian official

Please describe your issue below:

[ 8015.143803] xhci-hcd xhci-hcd.0.auto: xHCI host not responding to stop endpoint command.
[ 8015.143846] xhci-hcd xhci-hcd.0.auto: USBSTS: 0x00000000
[ 8015.143862] xhci-hcd xhci-hcd.0.auto: xHCI host controller not responding, assume dead
[ 8015.148647] xhci-hcd xhci-hcd.0.auto: HC died; cleaning up
[ 8015.153042] xhci-hcd xhci-hcd.0.auto: xHCI host not responding to stop endpoint command.
[ 8015.153059] xhci-hcd xhci-hcd.0.auto: USBSTS: 0x00000001 HCHalted

whenever there are a few USB devices operating.

I saw in the librecomputer kernel that their device tree has some quirks defined:

usb@ffe09000 {
			status = "okay";
			compatible = "amlogic,meson-g12a-usb-ctrl";
			reg = <0x00 0xffe09000 0x00 0xa0>;
			interrupts = <0x00 0x10 0x04>;
			#address-cells = <0x02>;
			#size-cells = <0x02>;
			ranges;
			clocks = <0x02 0x2f>;
			resets = <0x05 0x22>;
			dr_mode = "otg";
			phys = <0x33 0x34 0x06 0x04>;
			phy-names = "usb2-phy0\0usb2-phy1\0usb3-phy0";
			phandle = <0x137>;

			usb@ff400000 {
				compatible = "amlogic,meson-g12a-usb\0snps,dwc2";
				reg = <0x00 0xff400000 0x00 0x40000>;
				interrupts = <0x00 0x1f 0x04>;
				clocks = <0x02 0x37>;
				clock-names = "otg";
				phys = <0x34>;
				phy-names = "usb2-phy";
				dr_mode = "peripheral";
				g-rx-fifo-size = <0xc0>;
				g-np-tx-fifo-size = <0x80>;
				g-tx-fifo-size = <0x80 0x80 0x10 0x10 0x10>;
				phandle = <0x138>;
			};

			usb@ff500000 {
				compatible = "snps,dwc3";
				reg = <0x00 0xff500000 0x00 0x100000>;
				interrupts = <0x00 0x1e 0x04>;
				dr_mode = "host";
				snps,dis_u2_susphy_quirk;
				snps,quirk-frame-length-adjustment = <0x20>;
				snps,parkmode-disable-ss-quirk;
				phandle = <0x139>;
			};
		};

Do we have these in our kernel device tree?

Thank you very much!

Symptom is: keep some USB devices busy, for about 1 minute the USB core is dead and all USB devices cease to work, even after re-plugging. Only way to restore is reboot.

Hello @proffan

Do you use our PD adaptor to supply the power?

Could you provide the steps to reproduce this issue?

And could you also check our Ubuntu image vim4-ubuntu-22.04-gnome-linux-5.15-fenix-1.6-231229? You can install it with OOWOW online.

Hi @numbqq I am using a 38W PD power supply, should be plenty for the load (CPU under 30% and about 500mA of USB current).

To reproduce, connect a hub (any kind of USB hub is fine) and plug 3 USB 2.0 devices to the hub (for me it’s 3 ESP32 Serial JTAG, I believe you can use any serial to usb converter) and read from them simultaneously. After a while the USB disconnects with the message shown above, and the USB controller is dead.

I suspect a quirk issue because of:

which is very similar.

I will need some time for testing here as I am very busy for a deadline, but I can test maybe tomorrow morning.

Also, is the controller on the A311D2 DWC2/3 ?

I think a hub and two usb disks should do the same job, just a guess…

@ivan.li Please try to reproduce this issue.

It is not the same.

1 Like

Is there any information which USB core is used? Thank you very much!

@ivan.li @numbqq I did more experiments. It appears that this only happens when the hub is connected to the USB port closer to the Wi-Fi SOM.

1 in 3 times I can see this sequence of dmesg:

[ 1654.014271] [dhd] dhd_process_pkt_reorder_info: *Warning, new+flush, out=1, pending=0
[ 1656.128334] [dhd] dhd_process_pkt_reorder_info: *Warning, new+flush, out=1, pending=0
[ 1662.466353] [dhd] dhd_process_pkt_reorder_info: *Warning, new+flush, out=1, pending=0
[ 1662.714908] [dhd] dhd_process_pkt_reorder_info: *Warning, new+flush, out=1, pending=0
[ 1667.373785] [dhd] dhd_process_pkt_reorder_info: *Warning, new+flush, out=1, pending=0
[ 1669.467200] usb 1-1.2-port3: disabled by hub (EMI?), re-enabling...
[ 1669.470359] usb 1-1.2.3: USB disconnect, device number 16
[ 1670.703238] [dhd] dhd_process_pkt_reorder_info: *Warning, new+flush, out=1, pending=0
[ 1670.822865] [dhd] dhd_process_pkt_reorder_info: *Warning, new+flush, out=1, pending=0
[ 1673.065601] [dhd] dhd_process_pkt_reorder_info: *Warning, new+flush, out=1, pending=0
[ 1673.543246] [dhd] dhd_process_pkt_reorder_info: *Warning, new+flush, out=1, pending=0
[ 1674.477111] xhci-hcd xhci-hcd.0.auto: xHCI host not responding to stop endpoint command.
[ 1674.477129] xhci-hcd xhci-hcd.0.auto: USBSTS: 0x00000000
[ 1674.477135] xhci-hcd xhci-hcd.0.auto: xHCI host controller not responding, assume dead
[ 1674.478879] xhci-hcd xhci-hcd.0.auto: HC died; cleaning up
[ 1674.480224] xhci-hcd xhci-hcd.0.auto: xHCI host not responding to stop endpoint command.
[ 1674.480228] xhci-hcd xhci-hcd.0.auto: USBSTS: 0x00000001 HCHalted

which indicates some interference between USB and Wi-Fi. However that does not explain why the XHCI controller just dies and the port disabled until reboot.

@proffan
I have been unable to reproduce this issue on my end. I conducted tests on USB 3.0. the difference between us is ?

Hi @ivan.li my mistake, it’s the USB port on the far side to the Wi-Fi chip that’s having problems… (the 1300 one)

Hi~ @proffan
Can you try this firmware
https://dl.khadas.com/products/vim4/firmware/ubuntu/emmc/vim4-ubuntu-22.04-gnome-linux-5.15-fenix-1.6-231229-emmc.img.xz

Yes, I can test in 3 days, did you reproduce the problem?

Actually today I see the same problem on the Wi-Fi side of USB as well. It seems that this glitch is very random.

Hello @proffan

Please try to use our official image to check this issue, and also suggest you to use our official PD adaptor. As we can’t reproduce this issue on our side, so we need to keep the same hardware and software situation.

Thank you for the help!

I think I found the reason of this issue. The (immediate) cause of this problem is a bad USB cable causing USB hub to disable and re-enumerate the USB devices:
[ 1669.467200] usb 1-1.2-port3: disabled by hub (EMI?), re-enabling...

Note the USB Hub is working perfectly fine and the cable to the VIM4 has no problem. So there is still something wrong about the VIM4’s xHCI.

I will try to find a way to reproduce without a bad USB cable in the following days.

So another easy way to reproduce is frequently wiggle USB devices under a hub that make them have high error count and get disabled by hub. Anyways, the conclusion for now is that the USB on VIM4 is very experimental and will probably never get mainline support, due to the fact that the USB IP is probably from Corigine not the usual (and battle-tested) dwc3 people used to.

Unless Amlogic decides to do upstreaming effort we will be stuck at the vendor kernel.

The NPU is also pretty much experimental which, likely, is from VeriSilicon (GitHub - VeriSilicon/TIM-VX: Verisilicon Tensor Interface Module). Software quality is definitely not great by looking at their SDK.

I will be keeping my VIM4 for now, but in short term I think it’s not very production ready.

Please correct me if I am wrong.

But it is not a proper way to reproduce this issue.

VIM4 already has upstream plan for this year.

No, the NPU IP is from Amlogic not VeriSilicon, what’s the issues you have with the NPU?

Technically you are correct.
However in “real life testing” that is an issue that occurs and needs to be addressed. Its much better to have issues in the lab.

As @foxsquirrel said, you are technically correct. But it makes me nervous about using the VIM4 in production, as the xHCI would not come up again. I tested a few other platforms including one RPi4 and one Intel laptop. When exposed to same conditions they would just disconnect the misbehaving USB device and all other USB devices still work. On the VIM4 the xHCI dies and all USB ports will cease to work until reset.

This is very nice to hear! I really hope upstreaming would help improving the VIM4.

It’s almost obvious, look at TIM-VX/src/tim/vx/type_utils.cc at 8ca1382474d431f9032aa408b4abd532ccb95584 · VeriSilicon/TIM-VX · GitHub
and TIM-VX/include/tim/vx/ops/nbg.h at 8ca1382474d431f9032aa408b4abd532ccb95584 · VeriSilicon/TIM-VX · GitHub

The nbg prefix is self-revealing.

apparently Amlogic is licensing VS’s NPU core, but is not distributing the SDK as open source but a closed source blob. However the official VS SDK looks much much better so why can’t we just get support using the official SDK?

Thank you for your response.

Yes, you are right, we wil try to reproduce this issue on our side.

How do you know Amlogic is licensing VS’s NPU core? Anyway, you are right it is not open source.