[Khadas WiP] VIM4 NVMe IO Errors

Hello
I have the same problem with my Crucial CT500P5PSSD8 SSD Interne P5 Plus 500Go.

any news on this ? Because I’m stuck using the emmc. Unable to install Ubuntu on my NVME…

1 Like

I’ve got the Kingston A2000 that is supposed to work… but I let a copy run for a while last night associated with a PhotoPrism self-hosted app and in the middle of the night checked on it to find the file system was mounted read-only…

Only now did I have the time to look into why:

root@Khadas:/var/log# dmesg | grep nvme
[ 0.655449] nvme nvme0: pci function 0000:01:00.0
[ 0.655642] nvme 0000:01:00.0: enabling device (0000 → 0002)
[ 0.763881] nvme nvme0: missing or invalid SUBNQN field.
[ 0.868009] nvme nvme0: allocated 64 MiB host memory buffer.
[ 0.932090] nvme nvme0: 8/0/0 default/read/poll queues
[ 0.938808] nvme0n1: p1 p2
[ 5.198172] Adding 25106772k swap on /dev/nvme0n1p2. Priority:-2 extents:1 across:25106772k SS
[ 5.885765] EXT4-fs (nvme0n1p1): error loading journal
[ 118.571705] EXT4-fs (nvme0n1p1): error loading journal
[ 357.220837] EXT4-fs (nvme0n1p1): warning: mounting unchecked fs, running e2fsck is recommended
[ 357.238469] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: errors=remount-ro

I tried manually running fsck on it and it claimed a bunch of blocks were garbage. So I’m thinking there’s still something fishy with the drivers for the VIM4 specific to the NVME handling — I’m thinking that all NVME are susceptible to this issue — some just show it easier than others…

2 Likes

Clearly did not work when they started selling the product, I wondering if it will every work.

1 Like

I had my suspicions that the problem couldn’t be limited to certain brands/models. Unfortunately, @ps23Rick seems to have confirmed them.

Does anyone know if NVMe is supported and works faultlessly under Android?

If it is supported and works faultlessly, that would rule out the hardware as being the problem, which is something at least.

1 Like

As someone with a long software development background, my perspective is that Android will exhibit the same behavior as the underlying drivers are probably needing some tweaks.

I suppose there could be a race condition of some sort but I’ve not looked at the code and have zero experience with Linux kernel code regardless…

But if the hardware is fine and I’m believing that’s the case I’d hope that in due time the code adjustments will be made… but time will tell…

1 Like

PCIe and NVMe drivers are part of the kernel. PCIe and NVMe are protocols supporting various optional states (like power management for example).

As such comparing Android with Linux is pointless without taking into account the nature of technology.

  • do kernel versions differ between Android and Linux (honest question. I don’t use anything Amlogic because of the crippled IO on these SoCs)
  • what about PCIe basics like powermanagement? Is there /sys/module/pcie_aspm/parameters/policy and if so how does it read (the actual value is shown in brackets)? If this file isn’t there what does find /sys -iname "*aspm*" is telling?
1 Like

I don’t use Android other than on my phone and am never likely to, so I don’t know anything about it.

Surely the kernel is different between them?

What about Debian?

Has anyone managed to get that running and have NVMe working properly on a VIM4?

edit:

it seems possible - https://forum.khadas.com/t/building-debian-with-fenix/16791/15

unfortunately, it looks like it has the same problem - https://forum.khadas.com/t/building-debian-with-fenix/16791/14

1 Like

Usually Linux on SoCs from the Android world start with the very same kernel the Android images are assembled with. Also usually PCIe relevant kernel settings that might fit for Android do not fit for Linux. Powermanagement is the example.

IIRC BSP kernel for the VIM3 (Amlogic G12B family) was initially a 4.9, no idea what it’s now and whether it’s an updated Amlogic BSP kernel or some mainline variant (BSP = the SoC vendor’s ‘board support package’).

Silly me confused the devices. VIM4 is Amlogic T7 family and there’s only one 5.4.125 BSP kernel that is used everywhere. It’s an Android BSP kernel and the chances that settings fit for Linux use are close to 0%.

Debian, Ubuntu or Android doesn’t really matter since the board maker (or 3rd parties like Armbian) and not experts from Debian or Canonical assemble kernel and userland and usually don’t touch hardware settings they’re not familiar with.

As such I would recommend to anyone being affected by this issue to check ASPM first (Active State Powermanagement).

Search for the respective sysfs node and print the defaults: find /sys -iname "*aspm*". There should be something like /sys/module/pcie_aspm/parameters/policy and the active option is printed in brackets, e.g. [powersupersave].

3 Likes

mine says [default] performance powersave powersupersave.

1 Like

That’s a reasonable default since at least in my experiments I can trigger data corruption pretty easily with powersupersave. Asides ASPM (generic PCIe powermanagement) there’s another thing called APST (Autonomes Power State Transition) that allows the NVMe controller inside the SSD to do certain things.

Could be interesting whether those SSDs affected support more than one power state or not. And ofc checking error logs could help as well. It needs an apt install nvme-cli and then

  • nvme error-log /dev/nvme0 | grep ^status_field | grep -v SUCCESS to show errors
  • nvme id-ctrl /dev/nvme0 | grep -A1 -E "^ps|^apsta" for APST details

Power states are different with different SSDs. On consumer drives this can look like this:

apsta     : 0x1
wctemp    : 358
--
ps    0 : mp:7.80W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:3.40W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0700W non-operational enlat:210 exlat:1200 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0100W non-operational enlat:2000 exlat:8000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-

This drive support 5 different power states and can go as low as 0.0100W but then being in a ‘non-operational’ AKA sleep state. APST is supported: apsta : 0x1.

A data center SSD like this Samsung MZQL21T9HCJR-00A07 shows 2 power states while not being capable of APST (which ofc is BS and needs a special parameter to be usable with ASPT):

apsta     : 0
wctemp    : 353
--
ps    0 : mp:25.00W operational enlat:70 exlat:70 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:4.00W active_power:14.00W
ps    1 : mp:8.00W operational enlat:70 exlat:70 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:4.00W active_power:8.00W

And something in between (Transcend TS500GMTE240S) shows this:

apsta     : 0x1
wctemp    : 358
--
ps    0 : mp:9.00W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-

(APST enabled but only 1 power state doesn’t make that much sense either).

Anyway: maybe those SSDs unaffected only support 1 power state and those affected more?

1 Like

shows nothing at all. all it contains is 63 entries like this

.................
error_count	: 0
sqid		: 0
cmdid		: 0
status_field	: 0(SUCCESS: The command completed successfully)
phase_tag	: 0
parm_err_loc	: 0
lba		: 0
nsid		: 0
vs		: 0
trtype		: The transport type is not indicated or the error is not transp
ort related.
cs		: 0
trtype_spec_info: 0
.................

gives me

apsta     : 0x1
wctemp    : 356
--
ps    0 : mp:9.00W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-

1 Like

Ok, just one power state so while we have ruled out ASPM already (since default policy) most probably APST isn’t an issue as well. Nick reported Kingston A2000 would work and this specific SSD is one that was affected by an APST bug that was fixed last year and backported to 5.4.97 and as such the respective quirk is also in Amlogic’s 5.4.125.

As for the NVMe error-log this should be checked immediately after an issue or corruption occured. AFAIK it gets cleared once the SSD is power cycled.

1 Like

I’ll have to have a look at the log next time I get an error. thx for that

Interestingly, being as the NVMe drive was plugged in and I haven’t tried it in a while, I did a dd of the emmc to it and it completed although it only managed 35 MB/s. I’m going to try writing it to an SD card and see if it is any good.

BTW, the drive is a Netac N930 Pro that is formatted as NTFS because it normally lives in a USB3 caddy so I can use it on my Windows notebook.

update: didn’t work, system took ages to boot, had a default desktop and wouldn’t run properly with the SD card after restoring to it

1 Like

Unfortunately, @ps23Rick has even had problems with a Kingston A2000 https://forum.khadas.com/t/khadas-wip-vim4-nvme-io-errors/16572/23

1 Like

So that brings Nick’s count of working SSDs down to zero and all SSDs tested fail. Well, that makes searching for SSD differences (as tried above with APST) 100% pointless.

The kernel you VIM4 users all are using regardless of Android, Ubuntu, Debian, Arch/Manjaro is Amlogic’s most recent forward ported version (from 4.9, before from at least 3.14, 3.10 and whatnot) which is a 5.4.125. There’s nothing else and there won’t be anything else anytime soon or at all.

One huge problem with these vendor kernels is that the SoC vendor’s employees forward port the code base since forever and likely simply skip patches here and there when merge conflicts occured.

Nobody knows how much code and which areas this affects unless somebody takes the time and efforts to rebase this Amlogic kernel on a clean 5.4.125 LTS. This is not an Amlogic problem but one of ‘ARM SoCs originating from the Android world’ in general and as such applies to e.g. Allwinner or Rockchip as well: The radxa bsp kernel patches : from 5.10.67 to 5.10.123 - ROCK 5 Series - Radxa Forum

Well, 26 days ago Nick said they identified the issue and are working on it…

1 Like

I won’t hold my breath then.

It is a shame because it was the only reason I bought the thing.

1 Like

Wow… you’ve all been busy since I checked in last. I guess I’m just wondering if the next reasonable step to try and figure this out would be to do what was suggested above and take the time to re-baseline on a vanilla/fresh kernel source tree and slowly apply the respective bits of SOC specific code, test — lather, rinse, repeat.

Unfortunately it seems like this path would require assistance (perhaps lots) from the SOC vendor who has the intimate knowledge of the innards of the parts and that Khadas is just a middleman that can only try to fix things and not the actual source of the problems so to speak… Interesting … all of this… It’s too bad that Khadas is stuck in the middle. I just wonder how much help the SOC vendor would be willing to provide in an effort to make more future sales…?

Maybe I’m off on these assumptions — I’ll have to go back and re-read these posts again. I think it’s sad that Khadas and Nick have to try their best to try to fix the problems that have been put in their lap because of not so great things being done by the SOC supplier. (My interpretation)

1 Like

These are all the issues with SOCs and vendors that Armbian was created to address. I’ve seen Armbian and Igor, Balbes, etc. criticized for their zealous adherence to mainstream kernel adherence. This seems misplaced. Mainstream kernel support is CORE to the mission of Armbian existence.

The position that Khadas straddles, as a producer of amazing SOC hardware, in this case a practical NVMe SSD HW interface, perfectly illustrates the value of pursuing that mission.

I believe that much of the first bootstrap of Fenix images is inspired by the Armbian build and install system. These contributions help everyone.

I look forward to a future date where these issues have been solved for the Amlogic A311D2 with Armbian, and we can wring the full potential from the Khadas efforts.

— Jeremiah

That would be Amlogic. Why would they care about anything that doesn’t result in selling millions of SoCs?

Here’s Christian’s insights on what to expect from Amlogic and this latest version of their BSP kernel mess (and why A311D2 isn’t just an A311D with an added 2 but something new and entirely different):

What’s at the heart of VIM4 will be soon selling in the millions as this thing. That’s Amlogic’s market and not a few thousand SBC here or there.

Given the crappy I/O capabilites of this Amazon box most probably the whole PCIe thing inside this SoC has only been tested with Wi-Fi chipsets and not storage/NVMe. But who knows? At least it should be well understood that Amlogic does and cares only about TV boxes and the like and that I/O with Amlogic was always crap and will most probably remain crap in the foreseeable future. Since why would a TV box need decent I/O capabilities?

You have made a number of very valid points there @tkaiser.

I was really keen on getting Debian installed on mine but another thread on here about that subject mentions that BayLibre don’t consider the A311D2 to be a priority and I suspect other OS providers would be of the same opinion.

So, if Amlogic and the OS providers can’t or won’t sort it, where does that leave us?

I guess it rests on khadas, being as how they sold the device as having NVMe capability.

I’m sure it will be rather expensive for them to get it sorted but maybe even more so if it can’t be resolved.

I live in hope that the problem is resolved before the product becomes irrelevant.