[Khadas WiP] VIM4 NVMe IO Errors

Today I tried something different. I wanted to make an image of the NVMe. The system was under little load but after about 30 minutes of constant reading from the NVMe and the IO errors started. The NVMe (controller and actual memory chip) never went over 45c as I was monitoring it with an IR (FLIR) camera.

1 Like

I will keep posting any new info as I get it. I am planning to order a different NVMe device to see if it is an issue with this model NVMe.

Don’t know if you saw this but you are not alone

https://forum.khadas.com/t/970evo-1tb-on-vim4-new-m2x/15510/65

2 Likes

Hello @RIGeek @technodevotee @JeremiahCornelius

We need collect more SSD models about the NVMe SSD which don’t work.

As far as I know, the follow models have issues from your side.

  • Samsung 970 EVO 1TB
  • WD 2TB WD Green SN350
  • Netac N930E Plus
  • Sabrent Rocket Nano NVMe PCIe M.2 2242 SSD

Do you guys have other models you test and have issues?

Here is the test results from my side:

  • Kingston A2000 - Works
  • Samsung 980 250GB - Doesn’t work
  • WD 250GB WD Green SN550 - Doesn’t work
1 Like

For clarification, this is the “Sabrent Rocket Nano NVMe PCIe M.2 2242 SSD”

Best,
— Jeremiah

1 Like

I have searched around and do not have surplus NVMe drives. I have lots of surplus SATA and SAS drives. I am still experimenting to see what might be the cause. I’ve ruled out heat and system load completely now. The issue happens after extended high IO to or from the NVMe. It seems that the command queue of the NVMe might be being exceeded. I’ve not been able to prove this. Maybe over this weekend I can put more time into it.

1 Like

That sounds feasible to me given my experience.

When I tried making an image to my Netac formatted as NTFS, it was pretty slow (~35 MB/sec) but managed about 12GB. When I tried the same thing with it formatted as ext4, it was fast (~125 MB/sec) but failed after only a few GB.

So, it seems that the system handles NTFS quite slowly and not just on NVME because SD cards formatted as NTFS are much slower than those formatted as ext4 as well.

For whatever reason, it seems there’s a bottleneck that allows more data to be written before it craps out entirely.

Not that what actually gets written is any good, as I discovered when all my dashcam videos got corrupted.

1 Like

My NVMe is partitioned as LVM and formatted as ext4. I wanted ZFS but I would have had to rebuild the kernel.

1 Like

What nvmes work? @numbqq only: Kingston A2000?

Hello @RIGeek @technodevotee @Jart25

We have reproduced this issue on our side and we are are working on it now.

4 Likes

Whoowee! This is great.

great news, commenting just to follow the thread, as my 970 pro has the same IO buffer errors mentioned by others.

2 Likes

I mentioned above that I had added thermal pads to my NVMe device. While this did not help the issue, I still wanted to share this as NVMe devices do produce a decent amount of heat under load. The small controller chip, closest to the socket is typically where the majority of heat will be produced too. The M2X board has a lot of copper in it so it should work as a decent heat spreader.

1 Like

I can corroborate this, with a 1TB Sabrent Rocket m.2 NVMe. I have a thermal pad between the SSD and the M2X expansion board, and a sizable heatsink thermally adhered to the drive itself. Operating temperature averages 47c, never rising above 50c. The M2X becomes noticeably warm to the touch, validating your observation that it is a good dissapator, without ever becoming hot.

Regardless, the NVMe data corruption, read-only lockups, and occasional system freezes still occur when external cooling drops op temp to 40c.

— Jeremiah

1 Like

I have a couple aluminum heatsinks ready to adhear to the NVMe storage and controller chip but since putting the thermal pads, mine has not gone over 50c. After the io errors are addressed, I might need to add more cooling.

1 Like

Hello
I have the same problem with my Crucial CT500P5PSSD8 SSD Interne P5 Plus 500Go.

any news on this ? Because I’m stuck using the emmc. Unable to install Ubuntu on my NVME…

1 Like

I’ve got the Kingston A2000 that is supposed to work… but I let a copy run for a while last night associated with a PhotoPrism self-hosted app and in the middle of the night checked on it to find the file system was mounted read-only…

Only now did I have the time to look into why:

root@Khadas:/var/log# dmesg | grep nvme
[ 0.655449] nvme nvme0: pci function 0000:01:00.0
[ 0.655642] nvme 0000:01:00.0: enabling device (0000 → 0002)
[ 0.763881] nvme nvme0: missing or invalid SUBNQN field.
[ 0.868009] nvme nvme0: allocated 64 MiB host memory buffer.
[ 0.932090] nvme nvme0: 8/0/0 default/read/poll queues
[ 0.938808] nvme0n1: p1 p2
[ 5.198172] Adding 25106772k swap on /dev/nvme0n1p2. Priority:-2 extents:1 across:25106772k SS
[ 5.885765] EXT4-fs (nvme0n1p1): error loading journal
[ 118.571705] EXT4-fs (nvme0n1p1): error loading journal
[ 357.220837] EXT4-fs (nvme0n1p1): warning: mounting unchecked fs, running e2fsck is recommended
[ 357.238469] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: errors=remount-ro

I tried manually running fsck on it and it claimed a bunch of blocks were garbage. So I’m thinking there’s still something fishy with the drivers for the VIM4 specific to the NVME handling — I’m thinking that all NVME are susceptible to this issue — some just show it easier than others…

2 Likes

Clearly did not work when they started selling the product, I wondering if it will every work.

1 Like

I had my suspicions that the problem couldn’t be limited to certain brands/models. Unfortunately, @ps23Rick seems to have confirmed them.

Does anyone know if NVMe is supported and works faultlessly under Android?

If it is supported and works faultlessly, that would rule out the hardware as being the problem, which is something at least.

1 Like

As someone with a long software development background, my perspective is that Android will exhibit the same behavior as the underlying drivers are probably needing some tweaks.

I suppose there could be a race condition of some sort but I’ve not looked at the code and have zero experience with Linux kernel code regardless…

But if the hardware is fine and I’m believing that’s the case I’d hope that in due time the code adjustments will be made… but time will tell…

1 Like