[Khadas WiP] VIM4 NVMe IO Errors

great news, commenting just to follow the thread, as my 970 pro has the same IO buffer errors mentioned by others.

2 Likes

I mentioned above that I had added thermal pads to my NVMe device. While this did not help the issue, I still wanted to share this as NVMe devices do produce a decent amount of heat under load. The small controller chip, closest to the socket is typically where the majority of heat will be produced too. The M2X board has a lot of copper in it so it should work as a decent heat spreader.

1 Like

I can corroborate this, with a 1TB Sabrent Rocket m.2 NVMe. I have a thermal pad between the SSD and the M2X expansion board, and a sizable heatsink thermally adhered to the drive itself. Operating temperature averages 47c, never rising above 50c. The M2X becomes noticeably warm to the touch, validating your observation that it is a good dissapator, without ever becoming hot.

Regardless, the NVMe data corruption, read-only lockups, and occasional system freezes still occur when external cooling drops op temp to 40c.

— Jeremiah

1 Like

I have a couple aluminum heatsinks ready to adhear to the NVMe storage and controller chip but since putting the thermal pads, mine has not gone over 50c. After the io errors are addressed, I might need to add more cooling.

1 Like

Hello
I have the same problem with my Crucial CT500P5PSSD8 SSD Interne P5 Plus 500Go.

any news on this ? Because I’m stuck using the emmc. Unable to install Ubuntu on my NVME…

1 Like

I’ve got the Kingston A2000 that is supposed to work… but I let a copy run for a while last night associated with a PhotoPrism self-hosted app and in the middle of the night checked on it to find the file system was mounted read-only…

Only now did I have the time to look into why:

root@Khadas:/var/log# dmesg | grep nvme
[ 0.655449] nvme nvme0: pci function 0000:01:00.0
[ 0.655642] nvme 0000:01:00.0: enabling device (0000 → 0002)
[ 0.763881] nvme nvme0: missing or invalid SUBNQN field.
[ 0.868009] nvme nvme0: allocated 64 MiB host memory buffer.
[ 0.932090] nvme nvme0: 8/0/0 default/read/poll queues
[ 0.938808] nvme0n1: p1 p2
[ 5.198172] Adding 25106772k swap on /dev/nvme0n1p2. Priority:-2 extents:1 across:25106772k SS
[ 5.885765] EXT4-fs (nvme0n1p1): error loading journal
[ 118.571705] EXT4-fs (nvme0n1p1): error loading journal
[ 357.220837] EXT4-fs (nvme0n1p1): warning: mounting unchecked fs, running e2fsck is recommended
[ 357.238469] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: errors=remount-ro

I tried manually running fsck on it and it claimed a bunch of blocks were garbage. So I’m thinking there’s still something fishy with the drivers for the VIM4 specific to the NVME handling — I’m thinking that all NVME are susceptible to this issue — some just show it easier than others…

2 Likes

Clearly did not work when they started selling the product, I wondering if it will every work.

1 Like

I had my suspicions that the problem couldn’t be limited to certain brands/models. Unfortunately, @ps23Rick seems to have confirmed them.

Does anyone know if NVMe is supported and works faultlessly under Android?

If it is supported and works faultlessly, that would rule out the hardware as being the problem, which is something at least.

1 Like

As someone with a long software development background, my perspective is that Android will exhibit the same behavior as the underlying drivers are probably needing some tweaks.

I suppose there could be a race condition of some sort but I’ve not looked at the code and have zero experience with Linux kernel code regardless…

But if the hardware is fine and I’m believing that’s the case I’d hope that in due time the code adjustments will be made… but time will tell…

1 Like

PCIe and NVMe drivers are part of the kernel. PCIe and NVMe are protocols supporting various optional states (like power management for example).

As such comparing Android with Linux is pointless without taking into account the nature of technology.

  • do kernel versions differ between Android and Linux (honest question. I don’t use anything Amlogic because of the crippled IO on these SoCs)
  • what about PCIe basics like powermanagement? Is there /sys/module/pcie_aspm/parameters/policy and if so how does it read (the actual value is shown in brackets)? If this file isn’t there what does find /sys -iname "*aspm*" is telling?
1 Like

I don’t use Android other than on my phone and am never likely to, so I don’t know anything about it.

Surely the kernel is different between them?

What about Debian?

Has anyone managed to get that running and have NVMe working properly on a VIM4?

edit:

it seems possible - https://forum.khadas.com/t/building-debian-with-fenix/16791/15

unfortunately, it looks like it has the same problem - https://forum.khadas.com/t/building-debian-with-fenix/16791/14

1 Like

Usually Linux on SoCs from the Android world start with the very same kernel the Android images are assembled with. Also usually PCIe relevant kernel settings that might fit for Android do not fit for Linux. Powermanagement is the example.

IIRC BSP kernel for the VIM3 (Amlogic G12B family) was initially a 4.9, no idea what it’s now and whether it’s an updated Amlogic BSP kernel or some mainline variant (BSP = the SoC vendor’s ‘board support package’).

Silly me confused the devices. VIM4 is Amlogic T7 family and there’s only one 5.4.125 BSP kernel that is used everywhere. It’s an Android BSP kernel and the chances that settings fit for Linux use are close to 0%.

Debian, Ubuntu or Android doesn’t really matter since the board maker (or 3rd parties like Armbian) and not experts from Debian or Canonical assemble kernel and userland and usually don’t touch hardware settings they’re not familiar with.

As such I would recommend to anyone being affected by this issue to check ASPM first (Active State Powermanagement).

Search for the respective sysfs node and print the defaults: find /sys -iname "*aspm*". There should be something like /sys/module/pcie_aspm/parameters/policy and the active option is printed in brackets, e.g. [powersupersave].

3 Likes

mine says [default] performance powersave powersupersave.

1 Like

That’s a reasonable default since at least in my experiments I can trigger data corruption pretty easily with powersupersave. Asides ASPM (generic PCIe powermanagement) there’s another thing called APST (Autonomes Power State Transition) that allows the NVMe controller inside the SSD to do certain things.

Could be interesting whether those SSDs affected support more than one power state or not. And ofc checking error logs could help as well. It needs an apt install nvme-cli and then

  • nvme error-log /dev/nvme0 | grep ^status_field | grep -v SUCCESS to show errors
  • nvme id-ctrl /dev/nvme0 | grep -A1 -E "^ps|^apsta" for APST details

Power states are different with different SSDs. On consumer drives this can look like this:

apsta     : 0x1
wctemp    : 358
--
ps    0 : mp:7.80W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.40W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0700W non-operational enlat:210 exlat:1200 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0100W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

This drive support 5 different power states and can go as low as 0.0100W but then being in a ‘non-operational’ AKA sleep state. APST is supported: apsta : 0x1.

A data center SSD like this Samsung MZQL21T9HCJR-00A07 shows 2 power states while not being capable of APST (which ofc is BS and needs a special parameter to be usable with ASPT):

apsta     : 0
wctemp    : 353
--
ps    0 : mp:25.00W operational enlat:70 exlat:70 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:4.00W active_power:14.00W
ps    1 : mp:8.00W operational enlat:70 exlat:70 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:4.00W active_power:8.00W

And something in between (Transcend TS500GMTE240S) shows this:

apsta     : 0x1
wctemp    : 358
--
ps    0 : mp:9.00W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-

(APST enabled but only 1 power state doesn’t make that much sense either).

Anyway: maybe those SSDs unaffected only support 1 power state and those affected more?

1 Like

shows nothing at all. all it contains is 63 entries like this

.................
error_count	: 0
sqid		: 0
cmdid		: 0
status_field	: 0(SUCCESS: The command completed successfully)
phase_tag	: 0
parm_err_loc	: 0
lba		: 0
nsid		: 0
vs		: 0
trtype		: The transport type is not indicated or the error is not transp
ort related.
cs		: 0
trtype_spec_info: 0
.................

gives me

apsta     : 0x1
wctemp    : 356
--
ps    0 : mp:9.00W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-

1 Like

Ok, just one power state so while we have ruled out ASPM already (since default policy) most probably APST isn’t an issue as well. Nick reported Kingston A2000 would work and this specific SSD is one that was affected by an APST bug that was fixed last year and backported to 5.4.97 and as such the respective quirk is also in Amlogic’s 5.4.125.

As for the NVMe error-log this should be checked immediately after an issue or corruption occured. AFAIK it gets cleared once the SSD is power cycled.

1 Like

I’ll have to have a look at the log next time I get an error. thx for that

Interestingly, being as the NVMe drive was plugged in and I haven’t tried it in a while, I did a dd of the emmc to it and it completed although it only managed 35 MB/s. I’m going to try writing it to an SD card and see if it is any good.

BTW, the drive is a Netac N930 Pro that is formatted as NTFS because it normally lives in a USB3 caddy so I can use it on my Windows notebook.

update: didn’t work, system took ages to boot, had a default desktop and wouldn’t run properly with the SD card after restoring to it

1 Like

Unfortunately, @ps23Rick has even had problems with a Kingston A2000 https://forum.khadas.com/t/khadas-wip-vim4-nvme-io-errors/16572/23

1 Like

So that brings Nick’s count of working SSDs down to zero and all SSDs tested fail. Well, that makes searching for SSD differences (as tried above with APST) 100% pointless.

The kernel you VIM4 users all are using regardless of Android, Ubuntu, Debian, Arch/Manjaro is Amlogic’s most recent forward ported version (from 4.9, before from at least 3.14, 3.10 and whatnot) which is a 5.4.125. There’s nothing else and there won’t be anything else anytime soon or at all.

One huge problem with these vendor kernels is that the SoC vendor’s employees forward port the code base since forever and likely simply skip patches here and there when merge conflicts occured.

Nobody knows how much code and which areas this affects unless somebody takes the time and efforts to rebase this Amlogic kernel on a clean 5.4.125 LTS. This is not an Amlogic problem but one of ‘ARM SoCs originating from the Android world’ in general and as such applies to e.g. Allwinner or Rockchip as well: The radxa bsp kernel patches : from 5.10.67 to 5.10.123 - ROCK 5 Series - Radxa Forum

Well, 26 days ago Nick said they identified the issue and are working on it…

1 Like

I won’t hold my breath then.

It is a shame because it was the only reason I bought the thing.

1 Like