S912 limited to 1200 MHz with multithreaded loads

g4b42 · May 5, 2018, 3:06pm

the offsets in the bl30 ELF generator seem off though, as it appears the beginning of bl30.bin is an interrupt vectors table and not assembler code.

as we see here and here, and with this raw objdump command:

dd if=bl30.bin bs=440 skip=1 of=bl30-text.bin
arm-none-eabi-objdump -b binary -marm --prefix-addresses -EL -M force-thumb -D -C bl30-text.bin

we see that the output closely matches init.S from chromiumOS’s EC code:

Disassembly of section .data:
0x00000000 mov.w        r0, #0
0x00000004 msr  CONTROL, r0
0x00000008 isb  sy
0x0000000c ldr  r1, [pc, #60]   ; (0x0000004c)
0x0000000e ldr  r2, [pc, #64]   ; (0x00000050)
0x00000010 str  r1, [r2, #0]
0x00000012 mov.w        r0, #0
0x00000016 ldr  r1, [pc, #28]   ; (0x00000034)
0x00000018 ldr  r2, [pc, #28]   ; (0x00000038)
0x0000001a cmp  r1, r2
0x0000001c it   lt
0x0000001e strlt.w      r0, [r1], #4
0x00000022 blt.n        0x0000001a
0x00000024 ldr  r0, [pc, #44]   ; (0x00000054)
0x00000026 mov  sp, r0
0x00000028 bl   0x00005738
0x0000002c b.n  0x0000002c
0x0000002e b.w  0x00000158
0x00000032 nop

so with a little bit of adaptation, it can be made exploitable as an ELF file (maybe by taking chromium-ec’s linker file as well)

g4b42 · May 5, 2018, 3:12pm

also for reference: ARM’s Cortex-M3 documentation

g4b42 · May 6, 2018, 9:47am

looks like I wasn’t the first one to think about all this: the author of bl30-elf did too

birty · May 6, 2018, 10:26am

You finished getting us faster clocks yet?

Great stuff @g4b42

g4b42 · May 6, 2018, 11:12am

not yet, I can’t get my hands on that c2_freq_patch_0902.zip file, it seems to have been removed from the odroid forum, and the bl30.bin blob that was pushed to hardkernel’s u-boot git tree seems to have more changes than the few bytes @cyrozap talked about, so I can’t easily find them in the S912 binary…

g4b42 · May 6, 2018, 1:17pm

I’m gonna need that one too: c2_1.6MHz_freq_patch.zip

huantxo · May 7, 2018, 10:34am

Did you try these versions?:

g4b42 · May 7, 2018, 12:14pm

I want to avoid versions with changes unrelated to the max freq settings.

also, it appears the binary varies a lot depending on who at amlogic compiled it, probably due to different gcc versions, making comparison more difficult.

if nothing comes out of the hardkernel thread, I’ll resort to full-scale reverse engineering using radare2 and/or retdec and/or snowman decompiler… (good occasion to learn how to properly use these powerful tools)

huantxo · May 10, 2018, 10:36am

There is a fresh version of the ZIP files in the Odroid forums: https://forum.odroid.com/viewtopic.php?f=141&t=23044&p=223198#p223198

Also, there are some interesting remarks from the same author, in the linux-amlogic mailing lists: https://lists.infradead.org/pipermail/linux-amlogic/2017-May/003823.html

g4b42 · May 10, 2018, 12:06pm

excellent! thanks a lot

dukla2000 · May 13, 2018, 10:11pm

Yup have @Khadas any updates on this? This is gnucash starting on my VIM2 - allocated to CPU5 running 100% CPU but at 1000MHz!

huantxo · May 27, 2018, 12:21pm

@numbqq Definitely, this issue must be looked into. Right now, since the kernel allocates threads to the slow cores instead of the fast ones, it turns out that performance of VIM2 is worst than VIM1, and we paid twice as much for it.

numbqq · May 28, 2018, 4:26am

Well. When I run sysbench on my latest build, I got a different result.

root@Khadas:~# uname -a
Linux Khadas 3.14.29 #8 SMP PREEMPT Thu May 24 18:25:14 CST 2018 aarch64 aarch64 aarch64 GNU/Linux
root@Khadas:~# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
root@Khadas:~# sysbench --version
sysbench 0.4.12
root@Khadas:~#

Test script:

root@Khadas:~# cat test.sh 
#!/bin/bash
echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo performance >/sys/devices/system/cpu/cpu4/cpufreq/scaling_governor
for o in 1 4 8 ; do
	for i in $(cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies) ; do
		echo $i >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
		echo -e "$o cores, $(( $i / 1000)) MHz: \c"
		sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=$o 2>&1 | grep 'execution time'
	done
done
sysbench --test=cpu --cpu-max-prime=20000 run --num-threads=8 2>&1 | egrep "percentile|min:|max:|avg:"

Result:

root@Khadas:~# ./test.sh 
1 cores, 100 MHz:     execution time (avg/stddev):   374.5317/0.00
1 cores, 250 MHz:     execution time (avg/stddev):   147.9295/0.00
1 cores, 500 MHz:     execution time (avg/stddev):   73.3578/0.00
1 cores, 667 MHz:     execution time (avg/stddev):   55.1783/0.00
1 cores, 1000 MHz:     execution time (avg/stddev):   36.6099/0.00
1 cores, 1200 MHz:     execution time (avg/stddev):   30.5101/0.00
1 cores, 1512 MHz:     execution time (avg/stddev):   25.8465/0.00
4 cores, 100 MHz:     execution time (avg/stddev):   93.0469/0.01
4 cores, 250 MHz:     execution time (avg/stddev):   36.8459/0.01
4 cores, 500 MHz:     execution time (avg/stddev):   18.3660/0.00
4 cores, 667 MHz:     execution time (avg/stddev):   13.7663/0.00
4 cores, 1000 MHz:     execution time (avg/stddev):   9.1601/0.00
4 cores, 1200 MHz:     execution time (avg/stddev):   7.6329/0.00
4 cores, 1512 MHz:     execution time (avg/stddev):   6.4740/0.00
8 cores, 100 MHz:     execution time (avg/stddev):   46.8414/0.01
8 cores, 250 MHz:     execution time (avg/stddev):   26.5825/0.01
8 cores, 500 MHz:     execution time (avg/stddev):   15.3673/0.01
8 cores, 667 MHz:     execution time (avg/stddev):   12.0014/0.01
8 cores, 1000 MHz:     execution time (avg/stddev):   8.3543/0.00
8 cores, 1200 MHz:     execution time (avg/stddev):   7.0645/0.01
8 cores, 1512 MHz:     execution time (avg/stddev):   6.0570/0.01
         min:                                  2.58ms
         avg:                                  4.86ms
         max:                                136.68ms
         approx.  95 percentile:              36.82ms
root@Khadas:~#

g4b42 · May 28, 2018, 1:45pm

mainline kernel 4.17 is due out soon, with significant driver additions for GXM/S912 support, and even a kvim2 dtb, so it may be worth trying with it…

tkaiser · May 28, 2018, 3:17pm

Do you see the problem? Sysbench runs entirely inside the CPU (caches). Running on 4 cores the benchmark must finish in 1/4 the time compared to 1 core. With 8 cores there has to be again a reducation to the half compared to 4 cores.

Now look at the numbers above and especially result variation min/avg/max (or did heavy throttling occur?)

Cpufreq and scheduling ist totally broken on S912 at least with 4.9 kernel and current boot blobs.

dukla2000 · May 28, 2018, 5:58pm

Has a defect anyway - you are only swapping frequency of cores 0-3. Cores 4-7 are running 1000MHz throughout. I haven’t bothered to try sort this for multiple reasons, not least as on early parts of script (1 core) all the early runs on my machine are running on core 4 at 1000Mhz rather than something from core 0-3 at the set value.

Other reasons are I really don’t see much point running old versions of software unless we are regression testing something! Obviously sysbench moving the goalposts between v0.4.12 and v1.0.11 doesn’t help. But as per g4b42 I would prefer to go from 4.9.40 to 4.17 to try progress than to regress to 3.14.29

I am

root@VIM2:~# uname -a
Linux VIM2.dukla.net 4.9.40 #2 SMP PREEMPT Wed Sep 20 10:03:20 CST 2017 aarch64 aarch64 aarch64 GNU/Linux
root@VIM2:~# cat /etc/lsb-release 
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04 LTS"
root@VIM2:~# sysbench --version
sysbench 1.0.11

huantxo · June 21, 2018, 9:10pm

@numbqq @Gouwa Any news on this? It’s been almost two months and, honestly, you don’t seem to care much about this issue. I think that your flagship performing worst than Vim1 in every use case except those using 8 threads, is something to care about.

The truth is I am getting very disappointed on how you are dealing with this. Lots of posts in other threads about new accessories, and no attention to an important design flaw in the base product that has been clearly demonstrated. I don’t want to jump into conclusions, but I really expected more.

numbqq · June 22, 2018, 10:29am

Hi huantxo,

We have asked Amlogic about this issue, but they said they don’t limit the CPU frequency in BL blobs, so we haven’t find a way to resolve this now, sorry… But we will talk to Amlogic when we have enough evidence.

In what case VIM2 performs worse than VIM1? Can you kindly provide some scene？

Thanks.

huantxo · June 22, 2018, 12:36pm

Well, does this thread not provide “enough evidence”? I assume you have read all the posts, particularly those made by @tkaiser. What further evidence do you need? Those numbers are absolutely clear.

Again, in this thread there are several examples. The closest one is here, just five posts above, where the user shows that VIM2 is binding single-thread processes to slow cores, therefore running them at 1.0 Ghz (and VIM1 runs at 1.5 Ghz, or so it claims). There are other posts in this thread, and in other threads in this forum, like for example this one.

Of course, the fact that S912 performs worse than S905 in certain use cases, had already been pointed out almost two years ago, and I thought you were already aware of it.

The fact that you act as if you were unaware of things that have been so clearly exposed and demonstrated in this same forum, only confirms my feeling that you are not caring for this issue. As I said, it really disappoints me, I had put hopes in Khadas that it would be a company different to those that only aim to sell TV boxes to consumers who don’t know what they are buying. I hope that Khadas cares about the things that matter to makers, developers and advanced users. We have done our best to give you solid data in this forum, even some community members are investigating how to improve the blobs. I hope the story ends well, and we don’t get disappointed with the hopes we put in this company.

dukla2000 · July 31, 2018, 8:02pm

Mmm, I notice another month has passed.

It is fun to see my same use-case on a proper big.LITTLE implementation: gnucash allocated to a big core at max MHZ throughout means 55s load time as opposed to 128s on VIM2.