S912 limited to 1200 MHz with multithreaded loads

@numbqq However, I think you guys should ask Amlogic for binaries with unlocked DVFS. You don’t care much about those things when you are just using the device as a TV Box, but if you want to get serious about stuff like DIY, server, cluster, etc., you need to be able to tune the real performance/consumption ratio.

I have read that other companies have gotten that from Amlogic. I don’t think they should treat you differently just because you are a young company. If you need user support, I’m sure there’s many of us who would be willing to write mails to Amlogic complaining about their untrustworthy policy.

Only Hardkernel for their ODROID-C2. Neither FriendlyELEC (NanoPi K2 with S905) nor Libre Computer (‘Le Potato’) have blobs they are allowed to share. Read here what Da Xue (Libre Computer) writes: Some basic benchmarks for Le Potato? - Le Potato - Armbian Community Forums

Well, if they did it once, then they are obliged to do it with others. They have no right to make a distinction between “first-class” and “second class” companies.

I think so too. But Hardkernel’s issue was about 500Mhz difference of the promised speed. 1.5Ghz instead 2Ghz. I fear they don’t care about a -100Mhz difference.
I would love to see them care a lot more. SBC users depend a lot on what the SoC manufacturer releases for CPU control, hardware acceleration, … Too bad not enough people use SBC’s. Else they wouldn’t dare not to release important information.

Hi tkaiser,

I use different script that you provided.

#!/bin/bash
echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance >/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
for o in 1 4 8 ; do
	for i in $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies) ; do
		echo $i >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
		echo -e "$o cores, $(( $i / 1000)) MHz: \c"
		sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=$o 2>&1 | grep 'execution time'
	done
done
sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=8 2>&1 | egrep "percentile|min:|max:|avg:"

Result is:

1 cores, 100 MHz:     execution time (avg/stddev):   58.1148/0.00
1 cores, 250 MHz:     execution time (avg/stddev):   47.8097/0.00
1 cores, 500 MHz:     execution time (avg/stddev):   63.7481/0.00
1 cores, 667 MHz:     execution time (avg/stddev):   53.2392/0.00
1 cores, 1000 MHz:     execution time (avg/stddev):   36.7519/0.00
1 cores, 1200 MHz:     execution time (avg/stddev):   30.6434/0.00
1 cores, 1512 MHz:     execution time (avg/stddev):   25.8836/0.00
4 cores, 100 MHz:     execution time (avg/stddev):   12.0569/0.02
4 cores, 250 MHz:     execution time (avg/stddev):   14.3230/0.00
4 cores, 500 MHz:     execution time (avg/stddev):   12.1902/0.00
4 cores, 667 MHz:     execution time (avg/stddev):   11.0352/0.00
4 cores, 1000 MHz:     execution time (avg/stddev):   9.1944/0.00
4 cores, 1200 MHz:     execution time (avg/stddev):   8.0781/0.00
4 cores, 1512 MHz:     execution time (avg/stddev):   6.9720/0.00
8 cores, 100 MHz:     execution time (avg/stddev):   11.7022/0.02
8 cores, 250 MHz:     execution time (avg/stddev):   9.7152/0.01
8 cores, 500 MHz:     execution time (avg/stddev):   7.3731/0.01
8 cores, 667 MHz:     execution time (avg/stddev):   6.5240/0.01
8 cores, 1000 MHz:     execution time (avg/stddev):   5.3011/0.01
8 cores, 1200 MHz:     execution time (avg/stddev):   4.8013/0.02
8 cores, 1512 MHz:     execution time (avg/stddev):   4.3739/0.02
         min:                                  2.58ms
         avg:                                  3.39ms
         max:                                 30.63ms
         approx.  95 percentile:               3.68ms

And the other script:

#!/bin/bash
echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance >/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
for o in 1 4 8 ; do
	for i in $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies) ; do
		echo $i >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
		echo $i >/sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq 2>/dev/null
		case $o in
			1)
				TasksetParm="-c 0"
				;;
			4)
				TasksetParm="-c 0-3"
				;;
			*)
				TasksetParm="-c 0-7"
				;;
		esac
		echo -e "$o cores, $(( $i / 1000)) MHz: \c"
		taskset ${TasksetParm} sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=$o 2>&1 | grep 'execution time'
		cat /sys/devices/virtual/thermal/thermal_zone0/temp
	done
done

Result is:

1 cores, 100 MHz:     execution time (avg/stddev):   382.9829/0.00
43000
1 cores, 250 MHz:     execution time (avg/stddev):   148.9977/0.00
43000
1 cores, 500 MHz:     execution time (avg/stddev):   73.8164/0.00
43000
1 cores, 667 MHz:     execution time (avg/stddev):   55.2353/0.00
43000
1 cores, 1000 MHz:     execution time (avg/stddev):   36.7397/0.00
44000
1 cores, 1200 MHz:     execution time (avg/stddev):   30.5951/0.00
44000
1 cores, 1512 MHz:     execution time (avg/stddev):   25.9128/0.00
45000
4 cores, 100 MHz:     execution time (avg/stddev):   94.4586/0.01
43000
4 cores, 250 MHz:     execution time (avg/stddev):   37.1176/0.01
44000
4 cores, 500 MHz:     execution time (avg/stddev):   18.4188/0.00
45000
4 cores, 667 MHz:     execution time (avg/stddev):   13.7993/0.00
45000
4 cores, 1000 MHz:     execution time (avg/stddev):   9.1685/0.00
46000
4 cores, 1200 MHz:     execution time (avg/stddev):   7.6367/0.00
46000
4 cores, 1512 MHz:     execution time (avg/stddev):   6.4686/0.00
47000
8 cores, 100 MHz:     execution time (avg/stddev):   47.7804/0.01
44000
8 cores, 250 MHz:     execution time (avg/stddev):   18.7053/0.01
45000
8 cores, 500 MHz:     execution time (avg/stddev):   9.2905/0.00
45000
8 cores, 667 MHz:     execution time (avg/stddev):   6.9671/0.00
46000
8 cores, 1000 MHz:     execution time (avg/stddev):   4.6269/0.00
48000
8 cores, 1200 MHz:     execution time (avg/stddev):   4.1788/0.01
49000
8 cores, 1512 MHz:     execution time (avg/stddev):   3.8022/0.00
50000

I don’t know how Hardkernel get custom binary, but I think we can try to ask Amlogic for it.

3 Likes

That would be amazing - would add lots of extra value to the vim boards for me and I suspect a lot of others that like to play with these things! Getting better performance from the boards can only be a good thing

I know but the usage of those two scripts and the different results show clearly 3 different types of problems that need to be fixed:

  1. 4.9 kernel and scheduler: Demanding tasks do not end up on the faster cores (cpu 0-3) but for whatever reasons are sent to the slower ones (cpu 4-7). This needs to be fixed in the kernel (maybe @narmstrong has an idea how?) since one of the results is that especially single threaded real world tasks that need performance end up being limited to 1000 MHz which is clearly something you do not want to have on a device advertised as being capable of 1500 MHz, right?
  2. The kernel has no control over cpufreq clockspeeds. When we want 1512 MHz all we get in reality are 1416 MHz instead. This is something that does not affect performance that much since it’s a difference below 10% but still it’s annoying buying something advertised as being 1.5 GHz capable and then get 1.4 GHz in reality while the kernel and all usual tools report bogus numbers (1512 while it’s 1416 in reality)
  3. The real problem is that the bl30.bin thing seems to do make some weird decisions depending on CPU affinity. Even when we tell the cpufreq driver to always use maximum clockspeeds (be it 1512 or 1416 on the faster cores is irrelevant this time) this is not what’s happening without fixed CPU affinity. So when we’re not using taskset to pin tasks to specific CPU cores or clusters the firmware on the M3 decides on its own to do fancy things with real clockspeeds instead of using those the cpufreq driver demands. No idea why it’s that way but your results without taskset clearly show totally weird numbers both below 1000 MHz and above. The purpose of the cpufreq driver framework is to control this behavour and not just to give some hints some proprietary firmware running somewhere else is free to ignore (totally trashing performance as a side effect)

IMO the only real fix would be a new firmware comparable to the situation with Hardkernel and S905 that fixes the following issues

  • stop reporting bogus/faked values back to the cpufreq driver
  • do what the cpufreq driver wants. If the driver demands 1512 MHz then set 1512 MHz, if the driver demands 100 MHz then do this as well (the user for whatever reasons might want to save energy – allow him to do this)
  • stop the big.little emulation and treat all A53 in an equal way. It makes no sense to artificiialy differentiate between ‘fast’ and ‘slow’ cores if they’re all the same

And not related to the blob situation: the SMP/HMP scheduling needs a fix in your kernel since on S912 cpu 0-3 are always the cores where the work should end up first as long as the firmware plays big.LITTLE emulation.

1 Like

Because the kernel and the system can be assembled from different sources and with different compilers. Amlogic Buildroot uses the compiler 4.9 and your configuration set and the kernel source. I, for example, use other customization options and include patches and configuration options that allow the same kernel to run on the entire s9xxx line. This can already significantly change the behavior of the entire system.

Sure, but if I understood @numbqq correctly all he did differently was using two different scripts (that only differ wrt using taskset or not and adjusting clockspeeds on the ‘little’ cluster too) on exactly the same system running same userland and kernel. How to explain the differences and especially the totally weird real clockspeeds with hist first test?

Seriously: Is nobody here concerned that specifying clockspeeds via the cpufreq framework results in totally bogus real clockspeeds based on $something?

In this link there is a summary of how the community got to know about Amlogic BLOB’s reporting false speeds, and you can read between lines about Hardkernel’s reaction.

about the bl30.bin blob, this tree here seems to have some similarities with some of the strings found in the blob:

also there’s some documentation here

amlogic’s bl30 definitely seems forked off this source (or some other common ancestor), so maybe it can be used as a basis for reverse engineering…

and now this

Really interesting. I wonder if the linux-meson community is aware of these advances towards reverse engineering the bl30.bin. Maybe @narmstrong knows about that.

the offsets in the bl30 ELF generator seem off though, as it appears the beginning of bl30.bin is an interrupt vectors table and not assembler code.

as we see here and here, and with this raw objdump command:

dd if=bl30.bin bs=440 skip=1 of=bl30-text.bin
arm-none-eabi-objdump -b binary -marm --prefix-addresses -EL -M force-thumb -D -C bl30-text.bin

we see that the output closely matches init.S from chromiumOS’s EC code:

Disassembly of section .data:
0x00000000 mov.w        r0, #0
0x00000004 msr  CONTROL, r0
0x00000008 isb  sy
0x0000000c ldr  r1, [pc, #60]   ; (0x0000004c)
0x0000000e ldr  r2, [pc, #64]   ; (0x00000050)
0x00000010 str  r1, [r2, #0]
0x00000012 mov.w        r0, #0
0x00000016 ldr  r1, [pc, #28]   ; (0x00000034)
0x00000018 ldr  r2, [pc, #28]   ; (0x00000038)
0x0000001a cmp  r1, r2
0x0000001c it   lt
0x0000001e strlt.w      r0, [r1], #4
0x00000022 blt.n        0x0000001a
0x00000024 ldr  r0, [pc, #44]   ; (0x00000054)
0x00000026 mov  sp, r0
0x00000028 bl   0x00005738
0x0000002c b.n  0x0000002c
0x0000002e b.w  0x00000158
0x00000032 nop

so with a little bit of adaptation, it can be made exploitable as an ELF file (maybe by taking chromium-ec’s linker file as well)

2 Likes

also for reference: ARM’s Cortex-M3 documentation

1 Like

looks like I wasn’t the first one to think about all this: the author of bl30-elf did too

1 Like

You finished getting us faster clocks yet? :stuck_out_tongue:

Great stuff @g4b42

1 Like

not yet, I can’t get my hands on that c2_freq_patch_0902.zip file, it seems to have been removed from the odroid forum, and the bl30.bin blob that was pushed to hardkernel’s u-boot git tree seems to have more changes than the few bytes @cyrozap talked about, so I can’t easily find them in the S912 binary…

1 Like

I’m gonna need that one too: c2_1.6MHz_freq_patch.zip