S912 limited to 1200 MHz with multithreaded loads

g4b42 · April 24, 2018, 5:35pm

assuming gxm (S912) bl30 is based on same code as gxl (S905), this commit could explain the 1.4 / 1.5 Ghz discrepancy:

dukla2000 · April 24, 2018, 8:01pm

Appreciate this is now academic, but worked out how to run a single version and capture the results. Oh, and 2 threads happened to run on big cores and 2 on little.

$ sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=4 2>&1
WARNING: the --test option is deprecated. You can pass a script name or path on the command line without any options.
WARNING: --num-threads is deprecated, use --threads instead
sysbench 1.0.8 (using system LuaJIT 2.0.4)

Running the test with following options:
Number of threads: 4
Initializing random number generator from current time


Prime numbers limit: 20000

Initializing worker threads...

Threads started!

CPU speed:
    events per second:    77.99

General statistics:
    total time:                          10.0589s
    total number of events:              785

Latency (ms):
         min:                                 42.93
         avg:                                 51.11
         max:                                 62.86
         95th percentile:                     62.19
         sum:                              40125.05

Threads fairness:
    events (avg/stddev):           196.2500/33.25
    execution time (avg/stddev):   10.0313/0.02

tkaiser · April 24, 2018, 8:21pm

Indeed but thank you anyway. So operation mode of sysbench has changed, standard execution time is now only 10 seconds by default instead of running until all prime numbers are calculated. At least standard deviation shows that some threads were running on the slow little and some on the faster little cores.

This is a clear sign of scheduler madness (the faster cores should get the jobs of course), then @numbqq’s numbers with and without fixed CPU affinity are just strange, the whole S912 design is strange (being a stupid little.LITTLE design with one of the clusters being limited to lower clockspeeds for exactly no reason) and then the ‘firmware’ or mailbox interface cheating on us (and having to rely on proprietary crap like bl30.bin BLOBs who control the CPU cores instead of the kernel) is the next annoyance.

If Amlogic really capped the real clockspeeds down to 1416 MHz already two years ago at a time they advertised their SoCs as being capable of running at 2.0 GHz this is just an insane joke.

Not interested in anything S912 or Amlogic related any more…

shoog · April 24, 2018, 8:34pm

For the main application this chip is designed for this entirely academic since it is totally capable of decoding all video types. Its a shame that AMlogic has chosen to destroy its reputation in this way but soo what ?
For me it’s the lack of product support from AMlogic which sucks.

Shoog

huantxo · April 25, 2018, 8:02pm

Just got my Vim2 and tried the script. I’m not sure about the real speed, but at least the proportion between clock speeds and number of threads is correct (i.e., 1 thread @100 ~= 10 x 1thread@1000; 1 thread@1512 ~= 4 x 4 thread@1512, etc.)

1 cores, 100 MHz:     execution time (avg/stddev):   382.4676/0.00
Temp: 50000
1 cores, 250 MHz:     execution time (avg/stddev):   148.9245/0.00
Temp: 49000
1 cores, 500 MHz:     execution time (avg/stddev):   73.8128/0.00
Temp: 49000
1 cores, 667 MHz:     execution time (avg/stddev):   55.2327/0.00
Temp: 49000
1 cores, 1000 MHz:     execution time (avg/stddev):   36.7415/0.00
Temp: 50000
1 cores, 1200 MHz:     execution time (avg/stddev):   30.5977/0.00
Temp: 50000
1 cores, 1512 MHz:     execution time (avg/stddev):   25.9122/0.00
Temp: 51000
4 cores, 100 MHz:     execution time (avg/stddev):   95.1046/0.01
Temp: 48000
4 cores, 250 MHz:     execution time (avg/stddev):   37.1072/0.01
Temp: 48000
4 cores, 500 MHz:     execution time (avg/stddev):   18.4622/0.01
Temp: 49000
4 cores, 667 MHz:     execution time (avg/stddev):   13.7924/0.01
Temp: 49000
4 cores, 1000 MHz:     execution time (avg/stddev):   9.1759/0.00
Temp: 51000
4 cores, 1200 MHz:     execution time (avg/stddev):   7.6418/0.00
Temp: 52000
4 cores, 1512 MHz:     execution time (avg/stddev):   6.4735/0.00
Temp: 54000
8 cores, 100 MHz:     execution time (avg/stddev):   48.0306/0.01
Temp: 48000
8 cores, 250 MHz:     execution time (avg/stddev):   18.7397/0.01
Temp: 49000
8 cores, 500 MHz:     execution time (avg/stddev):   9.2862/0.00
Temp: 50000
8 cores, 667 MHz:     execution time (avg/stddev):   6.9622/0.00
Temp: 52000
8 cores, 1000 MHz:     execution time (avg/stddev):   4.6392/0.00
Temp: 54000
8 cores, 1200 MHz:     execution time (avg/stddev):   4.1788/0.01
Temp: 56000
8 cores, 1512 MHz:     execution time (avg/stddev):   3.8117/0.01
Temp: 58000

Using default Khadas dual boot img (VIM2_DualOS_Nougat_Ubuntu-16.04_V171028)
$ uname -a
Linux Khadas 4.9.40 #2 SMP PREEMPT Wed Sep 20 10:03:20 CST 2017 aarch64 aarch64 aarch64 GNU/Linux

tkaiser · April 25, 2018, 8:09pm

Yep, same numbers as @numbqq generated above confirming that you’re running with 1416 MHz maximum.

But I’ve not the slightest idea why the other test @numbqq made before shows such weird results. It seems cpufreq behaviour on Amlogic platforms is not reproducible (see also Amlogic still cheating with clockspeeds - Page 2 - Amlogic meson - Armbian Community Forums)

huantxo · April 25, 2018, 8:15pm

@numbqq However, I think you guys should ask Amlogic for binaries with unlocked DVFS. You don’t care much about those things when you are just using the device as a TV Box, but if you want to get serious about stuff like DIY, server, cluster, etc., you need to be able to tune the real performance/consumption ratio.

I have read that other companies have gotten that from Amlogic. I don’t think they should treat you differently just because you are a young company. If you need user support, I’m sure there’s many of us who would be willing to write mails to Amlogic complaining about their untrustworthy policy.

tkaiser · April 25, 2018, 8:19pm

Only Hardkernel for their ODROID-C2. Neither FriendlyELEC (NanoPi K2 with S905) nor Libre Computer (‘Le Potato’) have blobs they are allowed to share. Read here what Da Xue (Libre Computer) writes: Some basic benchmarks for Le Potato? - Le Potato - Armbian Community Forums

huantxo · April 25, 2018, 8:38pm

Well, if they did it once, then they are obliged to do it with others. They have no right to make a distinction between “first-class” and “second class” companies.

NicoD · April 25, 2018, 9:51pm

I think so too. But Hardkernel’s issue was about 500Mhz difference of the promised speed. 1.5Ghz instead 2Ghz. I fear they don’t care about a -100Mhz difference.
I would love to see them care a lot more. SBC users depend a lot on what the SoC manufacturer releases for CPU control, hardware acceleration, … Too bad not enough people use SBC’s. Else they wouldn’t dare not to release important information.

numbqq · April 26, 2018, 1:21am

Hi tkaiser,

I use different script that you provided.

#!/bin/bash
echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance >/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
for o in 1 4 8 ; do
	for i in $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies) ; do
		echo $i >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
		echo -e "$o cores, $(( $i / 1000)) MHz: \c"
		sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=$o 2>&1 | grep 'execution time'
	done
done
sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=8 2>&1 | egrep "percentile|min:|max:|avg:"

Result is:

1 cores, 100 MHz:     execution time (avg/stddev):   58.1148/0.00
1 cores, 250 MHz:     execution time (avg/stddev):   47.8097/0.00
1 cores, 500 MHz:     execution time (avg/stddev):   63.7481/0.00
1 cores, 667 MHz:     execution time (avg/stddev):   53.2392/0.00
1 cores, 1000 MHz:     execution time (avg/stddev):   36.7519/0.00
1 cores, 1200 MHz:     execution time (avg/stddev):   30.6434/0.00
1 cores, 1512 MHz:     execution time (avg/stddev):   25.8836/0.00
4 cores, 100 MHz:     execution time (avg/stddev):   12.0569/0.02
4 cores, 250 MHz:     execution time (avg/stddev):   14.3230/0.00
4 cores, 500 MHz:     execution time (avg/stddev):   12.1902/0.00
4 cores, 667 MHz:     execution time (avg/stddev):   11.0352/0.00
4 cores, 1000 MHz:     execution time (avg/stddev):   9.1944/0.00
4 cores, 1200 MHz:     execution time (avg/stddev):   8.0781/0.00
4 cores, 1512 MHz:     execution time (avg/stddev):   6.9720/0.00
8 cores, 100 MHz:     execution time (avg/stddev):   11.7022/0.02
8 cores, 250 MHz:     execution time (avg/stddev):   9.7152/0.01
8 cores, 500 MHz:     execution time (avg/stddev):   7.3731/0.01
8 cores, 667 MHz:     execution time (avg/stddev):   6.5240/0.01
8 cores, 1000 MHz:     execution time (avg/stddev):   5.3011/0.01
8 cores, 1200 MHz:     execution time (avg/stddev):   4.8013/0.02
8 cores, 1512 MHz:     execution time (avg/stddev):   4.3739/0.02
         min:                                  2.58ms
         avg:                                  3.39ms
         max:                                 30.63ms
         approx.  95 percentile:               3.68ms

And the other script:

#!/bin/bash
echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance >/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
for o in 1 4 8 ; do
	for i in $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies) ; do
		echo $i >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
		echo $i >/sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq 2>/dev/null
		case $o in
			1)
				TasksetParm="-c 0"
				;;
			4)
				TasksetParm="-c 0-3"
				;;
			*)
				TasksetParm="-c 0-7"
				;;
		esac
		echo -e "$o cores, $(( $i / 1000)) MHz: \c"
		taskset ${TasksetParm} sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=$o 2>&1 | grep 'execution time'
		cat /sys/devices/virtual/thermal/thermal_zone0/temp
	done
done

Result is:

1 cores, 100 MHz:     execution time (avg/stddev):   382.9829/0.00
43000
1 cores, 250 MHz:     execution time (avg/stddev):   148.9977/0.00
43000
1 cores, 500 MHz:     execution time (avg/stddev):   73.8164/0.00
43000
1 cores, 667 MHz:     execution time (avg/stddev):   55.2353/0.00
43000
1 cores, 1000 MHz:     execution time (avg/stddev):   36.7397/0.00
44000
1 cores, 1200 MHz:     execution time (avg/stddev):   30.5951/0.00
44000
1 cores, 1512 MHz:     execution time (avg/stddev):   25.9128/0.00
45000
4 cores, 100 MHz:     execution time (avg/stddev):   94.4586/0.01
43000
4 cores, 250 MHz:     execution time (avg/stddev):   37.1176/0.01
44000
4 cores, 500 MHz:     execution time (avg/stddev):   18.4188/0.00
45000
4 cores, 667 MHz:     execution time (avg/stddev):   13.7993/0.00
45000
4 cores, 1000 MHz:     execution time (avg/stddev):   9.1685/0.00
46000
4 cores, 1200 MHz:     execution time (avg/stddev):   7.6367/0.00
46000
4 cores, 1512 MHz:     execution time (avg/stddev):   6.4686/0.00
47000
8 cores, 100 MHz:     execution time (avg/stddev):   47.7804/0.01
44000
8 cores, 250 MHz:     execution time (avg/stddev):   18.7053/0.01
45000
8 cores, 500 MHz:     execution time (avg/stddev):   9.2905/0.00
45000
8 cores, 667 MHz:     execution time (avg/stddev):   6.9671/0.00
46000
8 cores, 1000 MHz:     execution time (avg/stddev):   4.6269/0.00
48000
8 cores, 1200 MHz:     execution time (avg/stddev):   4.1788/0.01
49000
8 cores, 1512 MHz:     execution time (avg/stddev):   3.8022/0.00
50000

numbqq · April 26, 2018, 1:50am

I don’t know how Hardkernel get custom binary, but I think we can try to ask Amlogic for it.

birty · April 26, 2018, 4:50am

That would be amazing - would add lots of extra value to the vim boards for me and I suspect a lot of others that like to play with these things! Getting better performance from the boards can only be a good thing

tkaiser · April 26, 2018, 5:41am

I know but the usage of those two scripts and the different results show clearly 3 different types of problems that need to be fixed:

4.9 kernel and scheduler: Demanding tasks do not end up on the faster cores (cpu 0-3) but for whatever reasons are sent to the slower ones (cpu 4-7). This needs to be fixed in the kernel (maybe @narmstrong has an idea how?) since one of the results is that especially single threaded real world tasks that need performance end up being limited to 1000 MHz which is clearly something you do not want to have on a device advertised as being capable of 1500 MHz, right?
The kernel has no control over cpufreq clockspeeds. When we want 1512 MHz all we get in reality are 1416 MHz instead. This is something that does not affect performance that much since it’s a difference below 10% but still it’s annoying buying something advertised as being 1.5 GHz capable and then get 1.4 GHz in reality while the kernel and all usual tools report bogus numbers (1512 while it’s 1416 in reality)
The real problem is that the bl30.bin thing seems to do make some weird decisions depending on CPU affinity. Even when we tell the cpufreq driver to always use maximum clockspeeds (be it 1512 or 1416 on the faster cores is irrelevant this time) this is not what’s happening without fixed CPU affinity. So when we’re not using taskset to pin tasks to specific CPU cores or clusters the firmware on the M3 decides on its own to do fancy things with real clockspeeds instead of using those the cpufreq driver demands. No idea why it’s that way but your results without taskset clearly show totally weird numbers both below 1000 MHz and above. The purpose of the cpufreq driver framework is to control this behavour and not just to give some hints some proprietary firmware running somewhere else is free to ignore (totally trashing performance as a side effect)

IMO the only real fix would be a new firmware comparable to the situation with Hardkernel and S905 that fixes the following issues

stop reporting bogus/faked values back to the cpufreq driver
do what the cpufreq driver wants. If the driver demands 1512 MHz then set 1512 MHz, if the driver demands 100 MHz then do this as well (the user for whatever reasons might want to save energy – allow him to do this)
stop the big.little emulation and treat all A53 in an equal way. It makes no sense to artificiialy differentiate between ‘fast’ and ‘slow’ cores if they’re all the same

And not related to the blob situation: the SMP/HMP scheduling needs a fix in your kernel since on S912 cpu 0-3 are always the cores where the work should end up first as long as the firmware plays big.LITTLE emulation.

balbes150 · April 26, 2018, 11:49am

Because the kernel and the system can be assembled from different sources and with different compilers. Amlogic Buildroot uses the compiler 4.9 and your configuration set and the kernel source. I, for example, use other customization options and include patches and configuration options that allow the same kernel to run on the entire s9xxx line. This can already significantly change the behavior of the entire system.

tkaiser · April 26, 2018, 12:21pm

Sure, but if I understood @numbqq correctly all he did differently was using two different scripts (that only differ wrt using taskset or not and adjusting clockspeeds on the ‘little’ cluster too) on exactly the same system running same userland and kernel. How to explain the differences and especially the totally weird real clockspeeds with hist first test?

Seriously: Is nobody here concerned that specifying clockspeeds via the cpufreq framework results in totally bogus real clockspeeds based on $something?

huantxo · April 27, 2018, 2:27pm

In this link there is a summary of how the community got to know about Amlogic BLOB’s reporting false speeds, and you can read between lines about Hardkernel’s reaction.

g4b42 · May 4, 2018, 12:57pm

about the bl30.bin blob, this tree here seems to have some similarities with some of the strings found in the blob:

also there’s some documentation here

amlogic’s bl30 definitely seems forked off this source (or some other common ancestor), so maybe it can be used as a basis for reverse engineering…

g4b42 · May 5, 2018, 9:10am

and now this

huantxo · May 5, 2018, 12:41pm

Really interesting. I wonder if the linux-meson community is aware of these advances towards reverse engineering the bl30.bin. Maybe @narmstrong knows about that.