Underwhelming performance Khadas Vim2 Max in video rendering kdenlive


#41

Sorry, no experience with swap files so cant advise.

Another thought: check journalctl -b around the time when kdenlive “goes nuts”. I really don’t expect anything (because seems to be application because recurs so reliably at the same time in the run) but if there is something at the system level it should show there.

ps - another thought: do you have any feeling for CPU load “profile” on the Tinkerboard or the C2? Do they run more or less 100% CPU for the entire run? (I know nothing about kdenlive, but am fishing whether it has phases in a run and when the VIM2 goes nuts it has reached a phase the VIM is diabolical at. Indeed, a quick scan back though your posts above I am not clear whether there is just a single go nuts phase, or several?)


#42

is the situation I have seen a huge difference: a pure IO bound operation (dd read) with close to 0% CPU gets penalised like crazy - about 50% throughput gain swapping to performance. But NicoD has eliminated this anyway.


#43

For exactly this reason in Armbian we contain an utility that does real monitoring (clockspeeds, load and CPU utilization included).

sudo armbianmonitor -m 60

will update this every minute. Should even be able to cope with the two different A53 clusters on S912 since I added big.LITTLE support a while ago: https://github.com/armbian/build/commit/638e4445deded6817eec8ba1301d13bda00f65e4#commitcomment-22171714


#44

The others don’t show this behaviour.
I’ve tested the Rock64, XU4, C2, Tinker Board, Raspberry’s 2b, 3b, 3b+.
Only the Khadas doesn’t perform as what you would expect. I must say that the Tinker does extremely well compared to C2. But the C2 with this project goes over it’s 1.7GB available ram, so it needs to use the swap. The Tinker uses less ram with Lubuntu, and there’s 2GB available. But the Tinker is no option for me because you can’t cool it enough without a fan.
All results are on my Youtube page.


Thanks for the tips. I’m now reinstalling my C2, so I’ll have to wait for the tests with the Khadas. Thanks, cheers.


#45

I suspect you are correct with the swap (and tkaiser with the tmp): again my knowledge of linux tmp file handling is dismal but Prof Duckduckgo turned up a suggestion that when tmp files overflow the allocated memory they get sent to the swap file (my paraphrasing - hope not too misleading). In your case with no swap file it could get ugly.


#46

That’s why benchmarking needs always an active mode. If results are not as expected do not publish weird numbers but start to ask why. And the tools exist

iostat 60
vmstat 60
armbianmonitor -m 60

are a great way to get an idea what’s going on (armbianmonitor is only part of Armbian but after creating one single symlink it should work fairly well on any Ubuntu or Debian):

mkdir -p /etc/armbianmonitor/datasources
ln -s /sys/devices/virtual/thermal/thermal_zone0/temp /etc/armbianmonitor/datasources/soctemp
wget -O /usr/local/bin/armbianmonitor https://raw.githubusercontent.com/armbian/build/master/packages/bsp/common/usr/bin/armbianmonitor
chmod 755 /usr/local/bin/armbianmonitor

If that’s not already sufficient, stuff like strace is needed…


#47

Try to explain these facts.

Specify the exact version of the firmware (system) for each Board on which this software was launched. And it is desirable to show the output of the command “uname -a”.

In order for this to work, need support in the kernel itself (the configuration must include the appropriate options big.LITTLE + enabled options to use 8 cores).

IMHO This indicates that the code is built using A32-only instructions. Otherwise, Odroid would show better or closer handling.


#48

Huh? You’re talking about the Blender numbers? These are not facts but just some random numbers done with some software on different hardware in passive benchmarking mode. Passive benchmarking does not allow to generate any insights, collecting those numbers without meaning is not useful other than providing numbers clueless people can base decisions on (benchmarking as marketing instrument).

What we need is ACTIVE BENCHMARKING to generate insights.

It’s totally irrelevant how software A performs on 3 different SBC when it’s about getting an idea why software B starts to behave weird after some time on the Vim2. And I only wanted to give a hint to @NicoD that your remarks about armhf vs. arm64 or even ‘A32 emulation mode on A64 (via hypervisor)’ are misleading since there is no such thing.

He uses kdenlive Ubuntu Xenial arm64 distro packages on both ODROID-C2 and Vim2 and those require of course on both boards also running a 64-bit kernel since otherwise you can’t execute arm64 packages. There is also no hypervisor involved and of course also no ‘emulation’ whatsoever (if he would be running armhf software that would be a matter of Debian Multiarch which allows to simply add armhf packages to arm64 installs… but we’re not talking about this here).

The only question is why kdenlive starts to stop utilizing all CPU cores after some time. And this needs to be diagnosed if we want some insights. At least it’s not related to 32-bit vs. 64-bit.

No, I was only talking about armbianmonitor's ability to report different clockspeeds when there’s more than one CPU cluster. This does not require anything inside the kernel only the existence of a cpu4 node. Then armbianmonitor will show cpufreq values for both clusters but this is of no great use with S912 since the clockspeed values are faked anyway.

If you were talking about getting the scheduler to behave more properly (sending demanding tasks to the faster little cores instead of those limited to 1 GHz) then it might make some sense but I wonder why these settings are not already in place. Even if it’s not big.LITTLE at all here the weird big.LITTLE emulation mode artificially bottlenecking 4 cores to 1 GHz requires prefering CPU cores 0-3 for demanding tasks…


#49

Huh?

It’s a well known fact that some software when compiled for ARMv7/32-bit needs magnitudes less memory compared to the same software built for ARMv8/64-bit. This is not related to ‘A32-only instructions’ but the size of pointers (data structures) and so on.

See this extreme example: https://github.com/nodesource/distributions/issues/375#issuecomment-290440706

Performance of the task in question more or less the same but the armhf variant running with the same 64-bit kernel needs only 363MB while the same software built for arm64 consumes 663MB. This is one of the reasons it can make a lot of sense to run a 32-bit userland on 64-bit platforms when memory constraints are an issue: since while 64-bit/ARMv8 code might be slightly faster the amount of physical DRAM might also be a lot higher.

This might explain why the same software as armhf variant on the Tinkerboard is fine with 2 GB DRAM while it needs much more memory on the ODROID-C2 with same amount of physical memory.

What to do now? Stop speculations and do active benchmarking. Looking at what really happens. The tools are there, they just need to be used!


#50

Indeed 32-bit uses a bit less memory. But here its mostly becaue the Tinker uses Lubuntu wich is a lot lighter than Ubuntu Mate. And the Odroid C2 has got 300mb fixed as video memory. So 1.7GB compaired to 2GB available on the Tinker. (I dont know how the Tinker it`s video memory is allocated.)

Also indeed the C2 is ARM64 and the Khadas VIM2 too. Thats why I dont think this is the problem. Can I find a 32-bit OS for the C2? Would be a good test.

For me only the Kdenlive render times mather because thats what I use them for. Let me know if you want something else tested. My Linux knowledge is too small to know what to use. Indeed Kdenlive isnt a good test to compare real performanc of boards, I do this because I couldn`t find info about what sbc is best for video rendering.

khadas@Khadas:~$ uname -a
Linux Khadas 4.9.40 #2 SMP PREEMPT Wed Sep 20 10:03:20 CST 2017 aarch64 aarch64 aarch64 GNU/Linux

How do I get this info?

This also doesnt work. It installs ok, but after reboot nothing shows that its activated.
So swap file doesnt work and zram doesnt work. Maybe @Gouwa knows why this is?

Thank you all. I truly appreciate all your help. Cheers


#51

Can you write the detailed steps you are doing to start rendering and provide the sample source video file you are using ? I want to check all the options with different cores and settings on different ARM platforms.


#52

I use this file on every sbc.


Open Kdenlive, click open. Select Archived project instead of Kdenlive project(file type). Go to the place you’ve downloaded it and double click it. It will ask where to extract. Choose a folder, it will say Can’t create this folder. Don’t worry, it’s because it already excists. Then you’ll probably see Clip Problems. Click search recursively, and go to the place you’ve unpacked the files. Click Ok.
The project is opened, don’t browse thru the project before clicking on render, else everything will be loaded in the Ram.
So click, Render. There you choose the amount of threads you want to use. I always do the render in H.264/AAC (CBR), bitrate 4000, audio bitrate 160. All these settings should allready be ok.

That’s it. Then it’s a long time waiting.
Thank you very much.

p.s.: To check the cpu temp I wrote this program. You’ll probably have something better, but you never know it’s useful.


#53

Im now trying the same in openshot on the Khadas. It seems like its the same behaviour.


I`ll later try the same on the C2.


#54

@balbes150
I’m getting closer to the sollution. I took away all the effects, and then it did it all great. 1h07m.
Now doing a bench with only dissolve effect and it’s all bad.
So now I’ll see what’s the difference with the odroid C2 without all the effects.

1h07m is what I expected of it. There are a lot of effects in this video, so I could do with a lot less.

Stupid of me that I didn’t think of this earlier. In the middle there’s an effect, and there it went slow. The end is all effect and it went slow. How I didn’t notice this earlier…

Thanks for the help. I’ll let you know how the C2 does. Cheers
Update
The C2 did the project without effects in 1h23m11s.
**So from no difference, it’s now 16minutes difference. Still not as fast as it should be then. I’ll check the XU4 without effects. -> Result XU4 without effects is 40m16s **
But the effects are the biggest problem clearly. I’ll soon have a new list with times.


#55

Some calculations so you see what’s the difference. It’s very simplified and there are a lot of other factors in play here than clockspeed. Free ramspace(C2 hasn’t got enough for this project, the other do) and ram speed. But it does give a view about how bad the Khadas is doing.

kdenlive results 10m 1080p:
odroid xu4 : 46m23s 13600Mhz
odroid c2 1.75Ghz O.C.: 1h43m 7000Mhz
Tinker Board : 1h12m15 7200Mhz
Khadas Vim2 Max : 1h44m 10000Mhz
Rock64 : 2h13m 5200Mhz

C2 > XU4 Mhz 51 % Time 44.6 %
Time Tinker > XU4 Mhz 53 % Time 64 %
Khadas > XU4 Mhz 73.5 % Time 44.6 %
Rock64 > XU4 Mhz 38,25% Time 34.8 %

1h07 For Khadas is 69.2 %
And if we count in that the Khadas it’s big cores are at 1.4Ghz, then that’s right on the money.


#56

I think the mixed up results in the last list ?

It would be interesting to see the results of Odroid C2 without overclocking (in 1.5 Mz mode).

By the way, in Odroid C2 and Khadas you can easily install LXDE instead of MATE. Then the results will depend less on the size of the occupied memory.


#57

Tried 2 times with the big project at 1.5, it crashed every time at 1h50m(only a few minutes left there). Didn’t want to wait that long again. So I simplified it by doing a Render of only 1 minute of the Big Buck Bunny video. Again strange results with the Khadas. It doesn’t seem to like Kdenlive. Here’s the results.
1 minute render 1080p
C2
1.75Ghz 8m09
1.54Ghz 9m19
1.30Ghz 10m29
1Ghz 13m18

Khadas
1st time : 9m24
2nd time : 9m38
3th time : 9m23

The Khadas again didn’t use 100% of it’s capabilities. So without any effects. I did it 3 times. 2nd time was even a lot slower. I truly don’t know what’s happening here.
Update
I now tried with another 1080p video file, again 1minute long and all the sames parameters. Again the same behaviour. 1st result 9m51s, 2nd result 9m16s. It’s all over the place.


#58

I correctly understood that when trying to execute a former task (which is described in the 1st post of this topic) on the Odroid c2 in 1.5 mz mode (without overclocking), the task failed ?

Sorry, I didn’t realize what they meant ?


#59

Indeed: It wasnt a hard crash. But: "Render failed" message. Ive reinstalled the C2, I hope that didn`t mess up anything. It did do the long Bench fully on 1.75Ghz.

I tried the same test on the Khadas 3 times, I didnt get the same result every time. Im out of ideas. I thought I found it when removing the effects. But in a short 1 minute 1080p video it also doesnt work well. I don`t see the logic.

Any idea on how to test with swap file or zram? Or are there any other Linux-distros for the 912?
Thank you.


#60

I reinstalled Ubuntu. This time I could install the swap without any issues.
I tried Kdenlive again. But no difference. Still so damn slow.

I think the problem has got to do with how the cpu is given tasks.
I can’t find any other explanation for this behaviour.

Today I ordered a NanoPC T3+. Also Octa core, but one GB less memory. I hope that will do better for me.

I did this on the Odroid C2. It used about 200mb less memory, so that great. But the render wasn’t faster. I then found that nothing of the project was copied to the swap, but only the os things. So that didn’t have any influence on the render time of the Odroid. Again a good thing to know.

Thank you.