That would be amazing - would add lots of extra value to the vim boards for me and I suspect a lot of others that like to play with these things! Getting better performance from the boards can only be a good thing
I know but the usage of those two scripts and the different results show clearly 3 different types of problems that need to be fixed:
- 4.9 kernel and scheduler: Demanding tasks do not end up on the faster cores (cpu 0-3) but for whatever reasons are sent to the slower ones (cpu 4-7). This needs to be fixed in the kernel (maybe @narmstrong has an idea how?) since one of the results is that especially single threaded real world tasks that need performance end up being limited to 1000 MHz which is clearly something you do not want to have on a device advertised as being capable of 1500 MHz, right?
- The kernel has no control over cpufreq clockspeeds. When we want 1512 MHz all we get in reality are 1416 MHz instead. This is something that does not affect performance that much since it’s a difference below 10% but still it’s annoying buying something advertised as being 1.5 GHz capable and then get 1.4 GHz in reality while the kernel and all usual tools report bogus numbers (1512 while it’s 1416 in reality)
- The real problem is that the
bl30.bin
thing seems to do make some weird decisions depending on CPU affinity. Even when we tell the cpufreq driver to always use maximum clockspeeds (be it 1512 or 1416 on the faster cores is irrelevant this time) this is not what’s happening without fixed CPU affinity. So when we’re not usingtaskset
to pin tasks to specific CPU cores or clusters the firmware on the M3 decides on its own to do fancy things with real clockspeeds instead of using those the cpufreq driver demands. No idea why it’s that way but your results withouttaskset
clearly show totally weird numbers both below 1000 MHz and above. The purpose of the cpufreq driver framework is to control this behavour and not just to give some hints some proprietary firmware running somewhere else is free to ignore (totally trashing performance as a side effect)
IMO the only real fix would be a new firmware comparable to the situation with Hardkernel and S905 that fixes the following issues
- stop reporting bogus/faked values back to the cpufreq driver
- do what the cpufreq driver wants. If the driver demands 1512 MHz then set 1512 MHz, if the driver demands 100 MHz then do this as well (the user for whatever reasons might want to save energy – allow him to do this)
- stop the big.little emulation and treat all A53 in an equal way. It makes no sense to artificiialy differentiate between ‘fast’ and ‘slow’ cores if they’re all the same
And not related to the blob situation: the SMP/HMP scheduling needs a fix in your kernel since on S912 cpu 0-3 are always the cores where the work should end up first as long as the firmware plays big.LITTLE emulation.
Because the kernel and the system can be assembled from different sources and with different compilers. Amlogic Buildroot uses the compiler 4.9 and your configuration set and the kernel source. I, for example, use other customization options and include patches and configuration options that allow the same kernel to run on the entire s9xxx line. This can already significantly change the behavior of the entire system.
Sure, but if I understood @numbqq correctly all he did differently was using two different scripts (that only differ wrt using taskset
or not and adjusting clockspeeds on the ‘little’ cluster too) on exactly the same system running same userland and kernel. How to explain the differences and especially the totally weird real clockspeeds with hist first test?
Seriously: Is nobody here concerned that specifying clockspeeds via the cpufreq framework results in totally bogus real clockspeeds based on $something
?
I don’t know how Hardkernel get custom binary, but I think we can try to ask Amlogic for it.
In this link there is a summary of how the community got to know about Amlogic BLOB’s reporting false speeds, and you can read between lines about Hardkernel’s reaction.
about the bl30.bin blob, this tree here seems to have some similarities with some of the strings found in the blob:
also there’s some documentation here
amlogic’s bl30 definitely seems forked off this source (or some other common ancestor), so maybe it can be used as a basis for reverse engineering…
Really interesting. I wonder if the linux-meson community is aware of these advances towards reverse engineering the bl30.bin. Maybe @narmstrong knows about that.
the offsets in the bl30 ELF generator seem off though, as it appears the beginning of bl30.bin is an interrupt vectors table and not assembler code.
as we see here and here, and with this raw objdump command:
dd if=bl30.bin bs=440 skip=1 of=bl30-text.bin
arm-none-eabi-objdump -b binary -marm --prefix-addresses -EL -M force-thumb -D -C bl30-text.bin
we see that the output closely matches init.S from chromiumOS’s EC code:
Disassembly of section .data:
0x00000000 mov.w r0, #0
0x00000004 msr CONTROL, r0
0x00000008 isb sy
0x0000000c ldr r1, [pc, #60] ; (0x0000004c)
0x0000000e ldr r2, [pc, #64] ; (0x00000050)
0x00000010 str r1, [r2, #0]
0x00000012 mov.w r0, #0
0x00000016 ldr r1, [pc, #28] ; (0x00000034)
0x00000018 ldr r2, [pc, #28] ; (0x00000038)
0x0000001a cmp r1, r2
0x0000001c it lt
0x0000001e strlt.w r0, [r1], #4
0x00000022 blt.n 0x0000001a
0x00000024 ldr r0, [pc, #44] ; (0x00000054)
0x00000026 mov sp, r0
0x00000028 bl 0x00005738
0x0000002c b.n 0x0000002c
0x0000002e b.w 0x00000158
0x00000032 nop
so with a little bit of adaptation, it can be made exploitable as an ELF file (maybe by taking chromium-ec’s linker file as well)
not yet, I can’t get my hands on that c2_freq_patch_0902.zip file, it seems to have been removed from the odroid forum, and the bl30.bin blob that was pushed to hardkernel’s u-boot git tree seems to have more changes than the few bytes @cyrozap talked about, so I can’t easily find them in the S912 binary…
Did you try these versions?:
update BL30 BL31
I want to avoid versions with changes unrelated to the max freq settings.
also, it appears the binary varies a lot depending on who at amlogic compiled it, probably due to different gcc versions, making comparison more difficult.
if nothing comes out of the hardkernel thread, I’ll resort to full-scale reverse engineering using radare2 and/or retdec and/or snowman decompiler… (good occasion to learn how to properly use these powerful tools)
There is a fresh version of the ZIP files in the Odroid forums: https://forum.odroid.com/viewtopic.php?f=141&t=23044&p=223198#p223198
Also, there are some interesting remarks from the same author, in the linux-amlogic mailing lists: https://lists.infradead.org/pipermail/linux-amlogic/2017-May/003823.html
excellent! thanks a lot
4.9 kernel and scheduler: Demanding tasks do not end up on the faster cores (cpu 0-3) but for whatever reasons are sent to the slower ones (cpu 4-7). This needs to be fixed in the kernel (maybe @narmstrong has an idea how?) since one of the results is that especially single threaded real world tasks that need performance end up being limited to 1000 MHz which is clearly something you do not want to have on a device advertised as being capable of 1500 MHz, right?
Yup have @Khadas any updates on this? This is gnucash starting on my VIM2 - allocated to CPU5 running 100% CPU but at 1000MHz!
allocated to CPU5 running 100% CPU but at 1000MHz!
@numbqq Definitely, this issue must be looked into. Right now, since the kernel allocates threads to the slow cores instead of the fast ones, it turns out that performance of VIM2 is worst than VIM1, and we paid twice as much for it.