I would believe it’s a bit different. The S912 is a TV box SoC where big.LITTLE makes no sense at all. We already know that DVFS/cpufreq scaling is controlled by a BLOB and the values reported to and by the kernel are all bogus.
I would assume (and the tests you did almost half a year ago confirmed that) that the DVFS code simply clocks all CPU cores at 1200 MHz when multithreaded loads on more than 4 cores are running (while reporting bogus cpufreq and both /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq and /sys/devices/system/cpu/cpu4/cpufreq/scaling_cur_freq cheating on us).
Sysbench can also be used to identify this since it provides min, max and average values. If the little cluster would really be running just at 1.0 GHz while the big one runs at 1.5GHz those 4 sysbench output lines would reveal this:
Given S912 is clocking on the little cores with 1.0 GHz and on the big ones with 1.5 GHz the two following lines should show results below A64 (little) and between S5P6818 and H6 (big):
Observed action 100% CPU load on a single (allegedly 1.5MHz) core, I have VIM2 Max, always run performance governor
openssl speed -elapsed -evp aes-128-cbc
You have chosen to measure elapsed time instead of user CPU time.
Doing aes-128-cbc for 3s on 16 size blocks: 33910308 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 64 size blocks: 22548575 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 256 size blocks: 9222714 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 1024 size blocks: 2887950 aes-128-cbc's in 3.00s
Doing aes-128-cbc for 3s on 8192 size blocks: 389737 aes-128-cbc's in 3.00s
OpenSSL 1.0.2g 1 Mar 2016
built on: reproducible build, date unspecified
options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr)
compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fdebug-prefix-map=/build/openssl-Bwh9JU/openssl-1.0.2g=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
The 'numbers' are in 1000s of bytes per second processed.
type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
aes-128-cbc 180854.98k 481036.27k 787004.93k 985753.60k 1064241.83k
on a little core:
> taskset -c 7 openssl speed -elapsed -evp aes-128-cbc
> You have chosen to measure elapsed time instead of user CPU time.
> Doing aes-128-cbc for 3s on 16 size blocks: 23935181 aes-128-cbc's in 3.00s
> Doing aes-128-cbc for 3s on 64 size blocks: 15916089 aes-128-cbc's in 3.00s
> Doing aes-128-cbc for 3s on 256 size blocks: 6510493 aes-128-cbc's in 3.00s
> Doing aes-128-cbc for 3s on 1024 size blocks: 2038914 aes-128-cbc's in 3.00s
> Doing aes-128-cbc for 3s on 8192 size blocks: 275104 aes-128-cbc's in 3.00s
> OpenSSL 1.0.2g 1 Mar 2016
> built on: reproducible build, date unspecified
> options:bn(64,64) rc4(ptr,char) des(idx,cisc,16,int) aes(partial) blowfish(ptr)
> compiler: cc -I. -I.. -I../include -fPIC -DOPENSSL_PIC -DOPENSSL_THREADS -D_REENTRANT -DDSO_DLFCN -DHAVE_DLFCN_H -DL_ENDIAN -g -O2 -fdebug-prefix-map=/build/openssl-Bwh9JU/openssl-1.0.2g=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wl,-Bsymbolic-functions -Wl,-z,relro -Wa,--noexecstack -Wall -DSHA1_ASM -DSHA256_ASM -DSHA512_ASM
> The 'numbers' are in 1000s of bytes per second processed.
> type 16 bytes 64 bytes 256 bytes 1024 bytes 8192 bytes
> aes-128-cbc 127654.30k 339543.23k 555562.07k 695949.31k 751217.32k
Running the script the output is not pretty, but 4 consecutive lines (running on cores 0, 1, 2 & 3)
Which to me (but I am no expert) looks the same as the single threaded version. And those seem to be pretty much between the S5P6818/1.6 GHz & RK3328/1.3 GHz you quote. But also I don’t think supports your initial feeling
For completeness results with the script loading all 8 cores. Here I have deleted intermediate lines to just leave 8 consecutive results:
I only interpreted the numbers. Sysbench provides execution time and standard deviation so it’s pretty capable of reporting what’s happening.
For whatever reasons so far no one tested again with sysbench but at least it’s obvious that the cpufreq values able to set and retrieve to access ‘clockspeeds’ via sysfs at least for the big cluster are still bogus.
Based on your single threaded tests it looks like this with openssl:
AES encryption though is something special since this is done on an own special engine when ARMv8 Crypto Extensions are available as on the S912. So still curious how sysbench results look like as an example of full load directly on the CPU cores.
It’s pretty simple to let this small script run and report results:
#!/bin/bash
echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance >/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
for o in 1 4 8 ; do
for i in $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies) ; do
echo $i >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
echo -e "$o cores, $(( $i / 1000)) MHz: \c"
sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=$o 2>&1 | grep 'execution time'
done
done
sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=8 2>&1 | egrep "percentile|min:|max:|avg:"
Or using 7-zip’s benchmark mode (with 7-zip also memory performance plays an important role so it’s not an ideal tool to draw conclusions wrt count of CPU cores and actual clockspeeds. But if 7-zip performance on the big cluster is below RPi 3 numbers then there’s something seriously wrong):
sudo apt install p7zip
taskset -c 0-3 7zr b -mmt1
taskset -c 0-3 7zr b -mmt4
taskset -c 4-7 7zr b -mmt4
7zr b
Thank you. This was on an Ubuntu Xenial aarch64 OS image?
Based on the numbers the bl30.bin BLOB you’re using seems to do cpufreq scaling somewhat different compared to @balbes150’s test last year and I fear you were running into throttling exceeding or reaching 80°C at the end of the benchmark?
Anyway: the numbers are still totally bogus. 1 core at 100 MHz needing 58 seconds is impossible when running at 1000 MHz only takes 36.75 seconds.
Whoa - the script has a defect that I can see on gkrellm: it is setting scaling_max_freq on the big cores but then sysbench is running on the little cores. In fact when it got to 4 copies I could see 3 running on little cores and 1 on a big core!
I am no good at scripting but will try a mod in the next minutes.
Some things have changed since last year when @balbes150 tested but some not. Pretty obvious: the cpufreq scaling code running in Linux has only a limited influence on what’s happening in reality. Same with set and reported clockspeeds.
A task that runs completely inside the CPU cache has to run 10 times slower when running at 100 MHz compared to running at 1000 MHz. This is not the case here, we now even see completely weird relationships between cpufreq in Linux and real clockspeed, see. e.g. those single threaded results where 500 MHz performs lower than 250 MHz:
100: execution time (avg/stddev): 58.1148/0.00
250: execution time (avg/stddev): 47.8097/0.00
500: execution time (avg/stddev): 63.7481/0.00
Obviously what’s happening below 1000 MHz is totally weird since when translating between cpufreq set and real clockspeeds reported by the benchmark we look at this table:
Still weird behaviour below 1000 MHz but at least somewhat predictable. Also interesting/important: Back then he clearly showed that sysbench running on the 4 big cores was twice as slow as when running on all 8 cores:
4 cores, 1000 MHz: execution time (avg/stddev): 9.1695/0.00
4 cores, 1512 MHz: execution time (avg/stddev): 7.5821/0.00
8 cores, 1000 MHz: execution time (avg/stddev): 4.7245/0.01
8 cores, 1512 MHz: execution time (avg/stddev): 3.7980/0.01
Which is a clear indication that back then there was no big.LITTLE behaviour implemented under load and all CPU cores were running at 1200 MHz when performing intensive tasks. This has changed now and we see different behaviour but I fear throttling is also involved since the 8 thread results are pretty worse compared to before.
#!/bin/bash
echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance >/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
for o in 1 4 ; do
for i in $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies) ; do
echo $i >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
echo -e "$o cores, $(( $i / 1000)) MHz: \c"
taskset -c 0-3 sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=$o 2>&1 | grep 'execution time'
done
done
sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=8 2>&1 | egrep "percentile|min:|max:|avg:"
limiting to 4 cores but setting to always use big cores apart from last run ran on all 8. Observation on gkrellm of core freq and which core was in use was what I expected. But the results are (I think?) bizarre!
# ./s912sysbench.sh
1 cores, 100 MHz: execution time (avg/stddev): 10.2136/0.00
1 cores, 250 MHz: execution time (avg/stddev): 10.2282/0.00
1 cores, 500 MHz: execution time (avg/stddev): 10.1051/0.00
1 cores, 667 MHz: execution time (avg/stddev): 10.0186/0.00
1 cores, 1000 MHz: execution time (avg/stddev): 10.0253/0.00
1 cores, 1200 MHz: execution time (avg/stddev): 10.0355/0.00
1 cores, 1512 MHz: execution time (avg/stddev): 10.0126/0.00
4 cores, 100 MHz: execution time (avg/stddev): 10.0963/0.07
4 cores, 250 MHz: execution time (avg/stddev): 10.1236/0.02
4 cores, 500 MHz: execution time (avg/stddev): 10.0290/0.03
4 cores, 667 MHz: execution time (avg/stddev): 10.0589/0.02
4 cores, 1000 MHz: execution time (avg/stddev): 10.0330/0.02
4 cores, 1200 MHz: execution time (avg/stddev): 10.0162/0.01
4 cores, 1512 MHz: execution time (avg/stddev): 10.0291/0.01
min: 42.40
avg: 51.52
max: 90.89
95th percentile: 66.84
# ./s912b.sh
1 cores, 100 MHz: execution time (avg/stddev): 10.1409/0.00
44000
1 cores, 250 MHz: execution time (avg/stddev): 10.1826/0.00
43000
1 cores, 500 MHz: execution time (avg/stddev): 10.0523/0.00
43000
1 cores, 667 MHz: execution time (avg/stddev): 10.0149/0.00
44000
1 cores, 1000 MHz: execution time (avg/stddev): 10.0222/0.00
44000
1 cores, 1200 MHz: execution time (avg/stddev): 10.0302/0.00
45000
1 cores, 1512 MHz: execution time (avg/stddev): 10.0332/0.00
45000
4 cores, 100 MHz: execution time (avg/stddev): 10.3109/0.23
44000
4 cores, 250 MHz: execution time (avg/stddev): 10.1222/0.03
43000
4 cores, 500 MHz: execution time (avg/stddev): 10.0484/0.04
44000
4 cores, 667 MHz: execution time (avg/stddev): 10.0538/0.03
45000
4 cores, 1000 MHz: execution time (avg/stddev): 10.0345/0.02
46000
4 cores, 1200 MHz: execution time (avg/stddev): 10.0294/0.01
47000
4 cores, 1512 MHz: execution time (avg/stddev): 10.0166/0.01
49000
8 cores, 100 MHz: execution time (avg/stddev): 10.3265/0.15
45000
8 cores, 250 MHz: execution time (avg/stddev): 10.1160/0.08
46000
8 cores, 500 MHz: execution time (avg/stddev): 10.0486/0.03
46000
8 cores, 667 MHz: execution time (avg/stddev): 10.0346/0.03
47000
8 cores, 1000 MHz: execution time (avg/stddev): 10.0288/0.01
49000
8 cores, 1200 MHz: execution time (avg/stddev): 10.0155/0.02
51000
8 cores, 1512 MHz: execution time (avg/stddev): 10.0166/0.01
53000
Observed results on gkrellm as expected: note the little cores max_freq 1000 so at end of run (8 cores) when script increases freq 1000/1200/1512 gkrellm reports little cores constant at 1000.
Indeed. 10.1 seconds with an Ubuntu 16.04 aarch64 sysbench binary are achieved with 4 Cortex-A53 running at ~900 MHz (or 2 running at ~1800 MHz or 8 running at ~450 MHz)
uname -a
Linux VIM2.dukla.net 4.9.40 #2 SMP PREEMPT Wed Sep 20 10:03:20 CST 2017 aarch64 aarch64 aarch64 GNU/Linux
root@VIM2:/home/chris/bin# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 17.10
Release: 17.10
Codename: artful
The environment is far from pristine: last rebooted 6 days ago and running firefox and a couple of other desktop applications at the same time (as well as a desktop!) But the 10seconds constant seems odd to me!
Did a reboot, no change in results (10s “constant”). Did obseve temperature barely moves FWIW. Shutdown desktop, same result (although couldnt watch gkrellm). Did put a stopwatch on part of the test - the results lines really do appear every 10 seconds!