VIM3 rcu_preempt detected stalls on CPUs/tasks

I use kernel 5.4 and latest android aosp.
But sometimes the system will hang up. And sometimes rcu info info will be shown in console, but sometimes not.

[19971.576735] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[19971.577257] rcu: 	2-...0: (0 ticks this GP) idle=d0a/1/0x4000000000000000 softirq=1596658/1596658 fqs=1816 
[19971.586892] 	(detected by 1, t=5254 jiffies, g=4113101, q=4)
[19971.592488] Task dump for CPU 2:
[19971.595680] sugov:2         R  running task        0   175      2 0x0000000a
[19971.602663] Call trace:
[19971.605101]  __switch_to+0x220/0x270
[19971.608630]  __cpufreq_driver_target+0x3e4/0x4cc
[19971.613196]  sugov_work+0x50/0x68
[19971.616474]  kthread_worker_fn+0xf0/0x1b8
[19971.620437]  kthread+0x134/0x150
[19971.623631]  ret_from_fork+0x10/0x18

Does anyone meet this problem?

Best Regards
xxn
2022/02/08

I’m planning to send this to the kernel, as while it is 100% a ‘hack’ it also seems to be needed and the only way to solve the stall issues. Amlogic does similar in the vendor kernel and various experiments with less opp points being removed and voltages boosted etc. never seem to work.

Hi @chewitt

Thanks for you reply.
I am not familiar with kernel. Could you please explain why this may solve the cpu stall issue?
Can you post some background knowledge that I can refer to?

Best Regards

https://patchwork.kernel.org/project/linux-amlogic/patch/20220209135535.29547-1-christianshewitt@gmail.com/

^ I’m not able to explain why the stalls happen, but nobody else can explain either - see the comments in the patch submission.

1 Like

OK, I think you are right because this issue seems not reproducible when I set cpu governor to “performance”.

But I want to know that how you can quickly find the relationship between stall and cpu frequency? :grinning:

Best Regards

I’ve had a version of that patch in my kernel patchset for 18+ months so for me it’s an easily spotted problem with a known workaround/solution.

I means…When you met this issue for the first time, why you known removing opp is a workaround solution?

I just want to learn how you analyze the problem.

Thanks

I’d noticed the stalls while a retro-gaming derivative of the distro I work on (LibreELEC) wasn’t seeing them despite using the same kernel sources. One of the regular differences is the gaming folks force the performance governor on CPU/GPU so I’ve tried that change and seen that stalls stopped. That prompted some device-tree research in the vendor kernel where I’ve noticed Amlogic deleted the 100/250Mhz nodes, and then I’ve seen another vendor-kernel using distro deleting the 500/667MHz nodes. Testing with those removed from the upstream kernel gave the same (no stalls) result. One of the upstream kernel maintainers has suggested cpu opp-point voltage tweaks (another things Amlogic sources fiddle with) but those never resolved the issue. So the working change has been found through a tyypical combination of luck, educated guesswork, research, and a lot of trial-error testing. I’d love to have a more impressive story about analysis and diagnostics, but I’m not a coding developer so I have to fall-back onto old-school Engineering method … test incremental changes and observe the system to see if you can identify differences.

1 Like