Underwhelming performance Khadas Vim2 Max in video rendering kdenlive

NicoD · November 4, 2017, 10:15pm

Hello, I recently bought the Khadas Vim2 max and I’m testing it’s performance in video rendering with kdenlive.
I also have got an Odroid C2 and a Asus Tinkerboard to compare it with.

On paper the Khadas Vim2 max should be the best of these three with it’s extra 4 core’s and the extra gb of ram.
But the reality is that it performs the worst of them. I’m having problems understanding what the problem could be.
I’ve got two heatsinks and a fan on my Vim2, the termerature never goes higher than 55°C so no throttling.
I always have more than 1gb of ram in spare.

The first 90% of the render all cores are maxed out, the last 10% takes 40minutes and then the cores are far from maxed out.

It is a video project of 10 minutes in 1080p.
I have a video where I compared the Odroid C2 and the Tinkerboard, I use the same video project now on the Vim2.

The results until now:

Khadas Vim2: 1h43m46s
Odroid C2: 1h43m01s (overclocked to 1.75Ghz)
Tinkerboard: 1h12m15s

I’m filming everything again, but I’ll wait to publish until I’ve found out what the problem exactly is

I use SBC’s for video editing when traveling. That’s why my benchmarks are video rendering, that’s what I need them for.

I hope someone can help me with this. I’ll also ask on the kdenlive forums if somebody could give an explanation.
See:
https://forum.kde.org/viewtopic.php?f=265&t=142762

Greetings,

Nico Andy Dekerf (NicoD)

NicoD · November 8, 2017, 10:29pm

Hello all.
I have done some more tests. This time the BMW benchmark render in Blender.
There the Khadas Vim2 Max performs great as expected. All cores are maxed continuously and it beats the Odroid and the Tinker Board hugely.
So now I know that the performance of the cpu isn’t the issue.
But I still don’t have a clue of what the problem is in Kdenlive.
I will do some more research, if somebody has got any idea that may help, please let me know.

All the results until now:

Tinker Board

Power consumption @ max performance : 2.2Amps*5V = 10Watt
Kdenlive 10m 1080p render : 1h12m15s
Blender BMW bench : 2h29m42s

Odroid C2

Power consumption @ max performance : 1.5Amps*5V = 7.5Watt
Kdenlive 10m 1080p render : 1h43m01s
Blender BMW bench : 2h00m38s

Khadas Vim2

Power consumption @ max performance : 1.4Amps*5V = 7Watt
Kdenlive 10m 1080p render : 1h43m46s
Blender BMW bench : 1h18m55s

balbes150 · November 9, 2017, 6:16am

I have not tested this program, I can only guess. It is not optimized to work with the cores. In s912 uses two clusters of cores - 4 cores for “big” loads (frequency 1500) and 4 cores for “small” loads (frequency 1000). They have different frequency, power consumption and performance. When the program starts it tries to work with all kernels, and focuses on “slow” cores. I.e. flows that operate on “fast” the cores must wait for data processing to “slow” cores and the entire system operates at a frequency of slow cores (up to 1000).

NicoD · November 9, 2017, 6:10pm

Hello balbes150.

Thank you for your reaction.
I believe you must be close with your explanation.
Maybe Kdenlive doesn’t know how to work with those different cores. But it is strange that it goes ok the first 90%, and then suddenly goes so damn slow.
Or maybe there is a problem with the Ubuntu Mate for the Khadas Vim2. I don’t know.

Today I’ve recorded a video about this problem where you can see how the cpu behaves.

Any ideas on further tests? I shall try with another video editor. I’ll see then if it behaves the same.
In Blender the Khadas Vim2 did very good. So it’s not the cpu that’s too slow. There’s enough ram, and the eMMC is fast enough. It’s got to be software related then…
Thank you, have a nice day.

chavdarb · November 10, 2017, 10:41am

ODroid C2 is Amlogic S905 (Quad core Cortex A53 @ 1.536 GHz) and VIM2 is Amlogic S912 (Quad core Cortex A53 @ 1.536 GHz + Quad core Cortex A53 @ 1.0 GHz).
By the results of Kdenlive it seems to me the extra cores on VIM2 does not get used properly, so VIM2 performs much like C2 but with higher clock for clock performance because of the higher memory spec (DDR3 on C2 vs DDR4 on VIM2), which kind of mitigates the overclock.

Blender shows expected improvements of more cores used, however why this does not work in this specific program… i have no idea.

NicoD · November 10, 2017, 11:09am

Hi chavdarb. That could be so. My Odroid C2 is overclocked to 1.75Ghz, but my Khadas Vim2 has faster memory. So that would make it that they both are ready around the same time.
I have tested the Raspberry pi with Kdenlive at stock settings, and then overclocked the cpu’s, and then I overclocked the ram. The faster ram made the biggest improvement. So 1.5Ghz with ddr4 could be as fast as 1.75Ghz with ddr3.
(Video of the Raspberry overclocking)

But then again on that first 90% it shows that all cores are used to the max, and then the last part slows down a lot. Wouldn’t it show less performance overall if it was only utilising the big cores?

I will try another video editor in the weekend. Maybe that wil shine some light on this case. I hope so, because I want to use the Khadas for my video editing and rendering.
Thank you very much for your help.
Have a nice day.

chavdarb · November 10, 2017, 12:57pm

VIM2 is using fast DDR4 memory:
3GB Samsung DDR4 2400 MHz
K4A4G165WEBCRC
http://www.samsung.com/semiconductor/products/dram/consumer-dram/ddr4-component/K4A4G165WE?ia=2420
K4A8G165WBBCRC
http://www.samsung.com/semiconductor/products/dram/consumer-dram/ddr4-component/K4A8G165WB?ia=2420
Speed: RC suffix rating: DDR4-2400 (1200MHz @ CL=17, tRCD=17, tRP=17)

I hope you find the issue.

NicoD · January 29, 2018, 8:17pm

I now have bought an Odroid XU4. It also has got little/big core technology as the Khadas Vim2, but doesn’t have the problems that the khadas Vim2 has.
I again made a video about it.

If anybody knows what the problem with the Khadas Vim2 is, please let me know. I would like to use that on my next bicycletrip for video editing and rendering.
Greetings

tasinofan · January 29, 2018, 9:51pm

Could it be that the render software uses graphics library with hw support on XU4?
Just a guess …

balbes150 · January 30, 2018, 7:20am

IMHO Kdenlive compiled with armhf (32 bit optimizations). So it works better on a 32 bit system (which CPU architecture and the OS itself use 32-bit and armhf libraries). Odroid XU4 has all of the core 32 bits and a frequency of > 1.75 Mh (8 cores + 32 bit). So old programs work on it better.

NicoD · January 30, 2018, 9:37am

Thank you balbes150. Thats great info. That is then why the Tinkerboard does so well in Kdenlive, also 32-bit system. I tought that was the reason, but I was not sure of it. I also didnt know the XU4 was 32-bit, it is. I didn`t think of that.
Thank you. Have a nice day.

tkaiser · April 23, 2018, 1:19pm

Sorry, but that’s simply not true.

The Exynos 5422 on these ODROIDs has 4 fast ARM cores and 4 slow ones. The fast ones are Cortex-A15 clocked at 2 GHz. The slow ones are Cortex-A7 clocked at 1.4 GHz. The Tinkerboard has 4 fast cores (A17 at 2 GHz).

The S905 has 4 slow cores and S912 has 8 slow cores (A53 is in a line with A7 – the fast families are A15, A17, A72, A73 and so on. A53 is slow but energy efficient).

People love to only look at clockspeeds but that’s useless. An A15 or A17 running at 2 GHz is a lot faster than an A53 running at the same clockspeed (no matter of 64-bit vs. 32-bit). There’s a reason those boards with fast ARM cores consume a lot more energy than those with slow cores like A53.

Then Blender is a lot about memory performance: My new video about the Rock64 with Armbian - Rockchip - Armbian Community Forums

Then something strange happened at the end of the Kdenlive test so the usual reaction to something like this should be throwing away the results and re-testing in active benchmarking mode. No one is doing this since all SBC users are happy to only generate meaningless numbers in ‘passive benchmarking’ mode.

If something strange happens it needs to be diagnosed. Most basic measure when benchmarking anything is switching to performance governor prior to executing any tests and then running iostat 10 in another shell in parallel to the benchmark (to see whether strange things happen). Also it’s important to have an eye on real CPU clockspeeds (affected by throttling or vendor cheating – we should not forget we’re dealing here with an Amlogic SoC and those things cheat on us: http://forum.khadas.com/t/s912-limited-to-1200-mhz-with-multithreaded-loads/)

Also it should be noted that users for whatever bizarre reasons trust in DDR4 memory being faster than DDR3 (why? Since 4 is a higher numbers than 3?!) instead of doing the only reasonable thing: testing (just to realize that Vim2 performs not that great here as @g4b42 discovered.)

tkaiser · April 23, 2018, 2:03pm

Where’s the proof for this? Anyone ever tested for this?

When you did the sysbench tests last year your results showed exactly the opposite: Armbian for Amlogic S912 - Page 17 - General Chat - Armbian Community Forums

We were only manipulating the allowed maximum cpufreq of the little cluster (walking through 100 MHz until 1512 MHz) but this affected all CPU cores. Also it would be pretty weird if the scheduler sends demanding tasks to the little instead of the big cores. The whole idea that a TV box SoC uses big.LITTLE is already weird since TV boxes don’t run on battery and using two cluster of identical (slow and energy efficient) A53 cores makes also no sense.

Is anyone here able to simply test for this:

sudo apt install p7zip
taskset -c 0-3 7zr b
taskset -c 4-7 7zr b

If there are two clusters running on different clockspeeds results must vary a lot.

Gouwa · April 23, 2018, 2:07pm

Hi Tkaiser:
Nice to see you here

Do you have VIMs device, if not, we can arrange free samples to you.

@numbqq follow up.

numbqq · April 23, 2018, 2:39pm

Hi tkaiser,

Please check the results:

taskset -c 0-3 7zr b

root@Khadas:~# taskset -c 0-3 7zr b

7-Zip (A) 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)

RAM size:    2998 MB,  # CPU hardware threads:   8
RAM usage:   1701 MB,  # Benchmark threads:      8

Dict        Compressing          |        Decompressing
      Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
       KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS

22:    1717   395    423   1670  |    46898   396   1069   4229
23:    1506   393    390   1534  |    46251   397   1066   4231
24:    1447   384    405   1556  |    41583   372   1035   3857
25:    1399   396    402   1597  |    42898   397   1016   4034
----------------------------------------------------------------
Avr:          392    405   1589               390   1047   4088
Tot:          391    726   2838

taskset -c 4-7 7zr b

root@Khadas:~# taskset -c 4-7 7zr b

7-Zip (A) 9.20  Copyright (c) 1999-2010 Igor Pavlov  2010-11-18
p7zip Version 9.20 (locale=en_US.UTF-8,Utf16=on,HugeFiles=on,8 CPUs)

RAM size:    2998 MB,  # CPU hardware threads:   8
RAM usage:   1701 MB,  # Benchmark threads:      8

Dict        Compressing          |        Decompressing
      Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
       KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS

22:    1377   396    338   1339  |    34667   399    784   3126
23:    1338   399    342   1363  |    33984   398    781   3109
24:    1271   398    343   1367  |    33232   399    772   3082
25:    1196   398    343   1366  |    32227   399    759   3030
----------------------------------------------------------------
Avr:          398    341   1359               399    774   3087
Tot:          398    558   2223

Thanks.

tkaiser · April 23, 2018, 2:46pm

Thank you! So CPUs 0-3 are the ‘big’ ones and 4-7 the ‘little’ (on almost all other big.LITTLE implementations it’s different and the little cluster starts with cpu0). The difference between results is not that large so I’m still concerned about maximum clockspeeds when running stuff on all CPU cores.

Numbers for the openssl and sysbench tests as outlined here and there would be great too.

NicoD · April 24, 2018, 4:18pm

Great to see you guys looking into it.
I’ve red the “S912 limited to 1200 MHz with multithreaded loads” thread.
All great info, I’ve connected my Vim2 again and I’m going to do some more tests.

I also will try the Kdenlive benchmark again with a swap file or zram. I’ve noticed that all the SBC’s that do it well have got a swap file or zram. I don’t know if this is a cause of something. I’ll let you know when I know more.
Thank you all.

dukla2000 · April 24, 2018, 8:12pm

My 6p, make sure you are running performance governor (I found that had significant latency type impacts on simple tests like timing dd) and make sure you are loading the fast cores, conceivably something like
taskset -c 0-3
as a prefix to your task/script.

NicoD · April 24, 2018, 8:45pm

Thank you dukla2000.
I’ve just tried a first benchmark with nothing changed. The result was 1h43m14s. So not far of the original ones.
I’ll now try with performance governor on. This was on-demand.

I also tried to create a swap file but was unsuccesful. This is what I did.
khadas@Khadas:~$ sudo fallocate -l 1g /swapfile khadas@Khadas:~$ ls -lh /swapfile -rw-r–r-- 1 root root 1.0G Apr 24 19:15 /swapfile khadas@Khadas:~$ sudo chmod 600 /swapfile khadas@Khadas:~$ ls -lh /swapfile -rw------- 1 root root 1.0G Apr 24 19:15 /swapfile khadas@Khadas:~$ sudo mkswap /swapfile Setting up swapspace version 1, size = 1024 MiB (1073737728 bytes) no label, UUID=30fd36df-e71c-45e7-8a04-bde604fb56f6 khadas@Khadas:~$ sudo swapon --show khadas@Khadas:~$ free -h total used free shared buff/cache available Mem: 2.9G 378M 2.0G 21M 608M 2.4G Swap: 0B 0B 0B khadas@Khadas:~$
Swap file was created, but not activated. Any idea?

Also here’s a screenshot on a point it starts going slow. You see that the ram suddenly bumps, and then it slows down. After a while it goes good again, and at the end it’s for 30minutes like that.

tkaiser · April 24, 2018, 9:17pm

The S912 is no big.LITTLE design so there are no fast cores. It’s just 8 slow A53 and 4 of them allowed to clock up to 1.4 GHz while the other 4 are artificially slowed down to 1.0 GHz.

Since this is known the scheduler has to take care of this and move all demanding tasks to those cores that are allowed to clock slightly faster (as it’s done on real big.LITTLE designs everywhere). If the scheduler is broken by design and the user has to use taskset limiting processes to the 4 slightly faster clocked cores… what’s the purpose of an octa-core SoC then?