Log in

View Full Version : Where am I wasting most CPU cycles for least impact in quality?


gamebox
26th April 2016, 16:53
So, it's time for me to accept the fact that video codec of tomorrow doesn't run very well on yesterday's technology (AMD Phenom II Quad-Core at 2.7GHz). Now I need the best compromise in settings that will make regular encoding to HEVC possible.

When encoding to x264 before I used the slightly modified "very slow" preset with 9 ref and b-frames for 720p (4 for 1080p), ME range was 24 for 480p, 32 for 720p and 48 for 1080p, Trellis "Always On", and "Auto Variance" AQ. Typical encoding speed was about 1 fps for 720p using only one CPU core. With x265 I went for somewhat modified "slower" preset (next presets provided little improvement with significant slowdowns according to tests), but I got speed of about 0,5 fps for 480p using all 4 cores. So I need your advice about some settings here - which ones should I change for about 30-40% of speedup.

- wpp is enabled as I decided to sacrifice 1% quality for about 40% speedup (98% CPU vs 60-70%)
- max CU size 64, min 8, max TU size 32, max depth 2 inter, 2 intra
- ME star (would like to keep it), range 57 (this is where I have questions - why is range always the same in presets - isn't 57 unnecessarily large for 480p or even 720p?), merge set to 3
- b-pyramid, weightp, weightb all on
- 6 ref for up to 720p, ref-limit CU 1, depth 0. B-frames set to 9. Wouldn't want to lower refs, and also many encodes I did in x264 used 6-7 B-frames in low motion scenes, so I would like to keep B-frame count high
- AQ-mode 2, strength 1, qg-size 32, cutree off (deliberately so)
- uses switches: rect, amp, limit-modes, rd 6, psy-rd 1, rdoq 2, psy-rdoq 1, signhide, tmvp, b-intra, strong-intra-smoothing, deblock -1 -1, sao

How useful are the "rect" and "amp" I set manually? How much quality would I lose, and how much speed would I gain by switching them off, in percents? How much benefit in speed (if any) lowering ME-range brings - is value of 57 really necessary for all resolutions? How useful is b-intra, rd 6, tmvp? I would like to keep most of "slower" and motion analysis parameters, number of refs and b-frames too, but remove just ones that waste the most CPU cycles for the least impact on quality. I would say an 30-40 % speedup is my goal approximately as I plan to assemble another similar low-cost quad-core system to aid in encoding.

Thanks in advance for any help :)

RainyDog
26th April 2016, 19:00
So, it's time for me to accept the fact that video codec of tomorrow doesn't run very well on yesterday's technology (AMD Phenom II Quad-Core at 2.7GHz). Now I need the best compromise in settings that will make regular encoding to HEVC possible.

When encoding to x264 before I used the slightly modified "very slow" preset with 9 ref and b-frames for 720p (4 for 1080p), ME range was 24 for 480p, 32 for 720p and 48 for 1080p, Trellis "Always On", and "Auto Variance" AQ. Typical encoding speed was about 1 fps for 720p using only one CPU core. With x265 I went for somewhat modified "slower" preset (next presets provided little improvement with significant slowdowns according to tests), but I got speed of about 0,5 fps for 480p using all 4 cores. So I need your advice about some settings here - which ones should I change for about 30-40% of speedup.

- wpp is enabled as I decided to sacrifice 1% quality for about 40% speedup (98% CPU vs 60-70%)
- max CU size 64, min 8, max TU size 32, max depth 2 inter, 2 intra
- ME star (would like to keep it), range 57 (this is where I have questions - why is range always the same in presets - isn't 57 unnecessarily large for 480p or even 720p?), merge set to 3
- b-pyramid, weightp, weightb all on
- 6 ref for up to 720p, ref-limit CU 1, depth 0. B-frames set to 9. Wouldn't want to lower refs, and also many encodes I did in x264 used 6-7 B-frames in low motion scenes, so I would like to keep B-frame count high
- AQ-mode 2, strength 1, qg-size 32, cutree off (deliberately so)
- uses switches: rect, amp, limit-modes, rd 6, psy-rd 1, rdoq 2, psy-rdoq 1, signhide, tmvp, b-intra, strong-intra-smoothing, deblock -1 -1, sao

How useful are the "rect" and "amp" I set manually? How much quality would I lose, and how much speed would I gain by switching them off, in percents? How much benefit in speed (if any) lowering ME-range brings - is value of 57 really necessary for all resolutions? How useful is b-intra, rd 6, tmvp? I would like to keep most of "slower" and motion analysis parameters, number of refs and b-frames too, but remove just ones that waste the most CPU cycles for the least impact on quality. I would say an 30-40 % speedup is my goal approximately as I plan to assemble another similar low-cost quad-core system to aid in encoding.

Thanks in advance for any help :)

My 2 cents :-

- Set max CU size to 32 and max TU to 16 for 1080p and below for an all round win. Better quality, faster encoding and smaller filesizes to boot. Max inter and intra depth at 2 is fine.

- Keep ME star as it's almost the same speed as hex but better quality than even umh in my experience. Reduce me-range to 40 for 1080p, 30 for 720p and 20 for SD resolutions. Even those are overkill as a range of 57 is to cover 4k resolutions and 1080p is 25% of the pixels of 4k. So do the maths and a range of 16 per x264 default is probably still suitable for 1080p. Max merge 3 is fine.

- Keep b-pyramid, weightp, weightb all on.

- Ref's and b-frames don't have much impact on encoding speed unless you set both to max really. They do make decoding more tricky though. I mostly only do 1080p film encoding and leave ref's on 4 and b-frames on 8 for the occasional content that does use that many b-frames.

- AQ Mode 2 I mostly use as well but change qg-size to 16. Cutree off is to taste but bitrate and filesize shoots up whilst encoding speed shoots down at the same CRF as when left on. Personally I leave it on for the CRF ranges I target these days as, like x264 mb-tree, it helps more at lower bitrates.

- Turn both rect and amp off for a big speed saving with no quality impact from my tests. Limit-modes on, rd 4, psy-rd 2, rdoq level 2 and psy-rdoq 10 are what I mostly stick to for 1080p film content. Other than that, I set deblock to -2 -2 and turn sao off as it's a bit agressive with over smoothing for my taste.

- Finally, I also turn early-skip on for a solid speed boost with no ill effects as far as I've seen.

Good luck and let us know how you get on.

Motenai Yoda
29th April 2016, 20:24
- "Set max CU size to 32 and max TU to 16 for 1080p" don't give me any boost in quality or speed, just less bitrate and quality for the same crf
- Me range is a one-dimension thing, 4k has doubled dimensions, so "doing the mats" its 28.5 --> 32 not 16, and with x264 for 1080p usually is used 32 or 24.
Also with hex merange is almost irrelevant on speed, my results with hex, umh and star, merange 57/24 (cpu i7-920, on avx2 cpus will be different)

fps bitrate
hex 57 7.40 1445.72
hex 24 7.42 1464.07
umh 57 5.20 1443.99
uhm 24 6.05 1453.87
star 57 5.85 1445.03
star 24 6.55 1456.53
metrics and visual quality are pretty much the same, speed-wise seems umh isn't optimized as star

The other things only I don't like much aq2, I prefer aq1 or aq3 for dark stuff, I also use qg-size 16 and no-sao for film content, but not for anime content, and I think increasing psy-rd instead of psy-rdoq give better results.

Anyway I increase qcomp a bit, and use -F 1 and --lookahead-slice 0

RainyDog
30th April 2016, 09:38
The other things only I don't like much aq2, I prefer aq1 or aq3 for dark stuff, I also use qg-size 16 and no-sao for film content, but not for anime content, and I think increasing psy-rd instead of psy-rdoq give better results.

Anyway I increase qcomp a bit, and use -F 1 and --lookahead-slice 0

Thanks for your input. I prefer higher qcomp (usually 0.8) combined with higher CRF (ie. 22) over default qcomp 0.6 with lower CRF (ie. 20) too.

Can you elaborate on why you don't like aq-mode2? From my experience the results aren't wildly different to AQ1 but, as all sources benefit from different AQ strengths, AQ2 seems to do a good job of varying the strength itself.

I keep away from aq-mode 3 though. Wastes bits and skews distribution to my eyes.

JohnLai
30th April 2016, 14:09
-

The other things only I don't like much aq2, I prefer aq1 or aq3 for dark stuff, I also use qg-size 16 and no-sao for film content, but not for anime content, and I think increasing psy-rd instead of psy-rdoq give better results.

Anyway I increase qcomp a bit, and use -F 1 and --lookahead-slice 0

Huh? But in x264 tune animation preset, psy-rd is reduced. (Assuming if psy-rd of x265 is more or less similar to x264)
Why would increasing psy-rd increase encoded anime quality?

x264 tune animation;
--bframes {+2}
--psy-rd 0.4
--aq-strength 0.6
--ref {Double if >1 else 1}
--deblock 1:1

gamebox
30th April 2016, 15:37
Thanks for help, guys :)

I will first remove --rect and --amp. Will set merange to 24 for 480p, 32 for 720p and 48 for 1080p, and max CU size to 32/max TU to 16 for 1080p, qg-size to 16. Judging by Montenai Yoda's post, I should expect at least about 15% improvement in speed.

What is your opinion on these switches, how much speed benefit and quality loss can I expect:
--early-skip
--fast-intra
--tskip
--tskip-fast
--rd 4
--max-merge 4
--no-lossless
--no-cu-lossless

And when it comes to SAO, I know it is very CPU intensive, but does switching it off increase ringing, blocking, distorts the edges and brightly colored objects, etc? I expected a lot from it as it is the new algorithm in HEVC. How much speedup --no-sao would bring?

What about 10-bit encoding? From what I understand, it should bring massive decrease in banding, blocking, perhaps even in edge-retention, and all that with bitrate reduction. Why is it not considered mainstream, will decoder chips not support it and why? Can I encode regular video to 10-bit for the sake of improved compression primarily and better edge treatment too, although I would play it later on plain LED Full-HD monitor?

birdie
30th April 2016, 19:56
Why would you want to use x265 at all after all? Given your compromises x264 sounds like a much better deal, not to mention that x265 will ultimately yield a softer and smoother picture which is not necessarily good unless you encode anime.

MeteorRain
30th April 2016, 20:59
Also I noticed that you are probably using something like X4 810, which is pretty old and slow.

Upgrade to X56xx (~$40) and you'll see 3x speed increase, to E5-2670 ($80) and 5x speed increase. Far better than those tuning, which merely gives you 15%~20% difference.

Just my $0.01

benwaggoner
1st May 2016, 23:09
Huh? But in x264 tune animation preset, psy-rd is reduced. (Assuming if psy-rd of x265 is more or less similar to x264)
Why would increasing psy-rd increase encoded anime quality?

x264 tune animation;
--bframes {+2}
--psy-rd 0.4
--aq-strength 0.6
--ref {Double if >1 else 1}
--deblock 1:1
Different codecs can behave quite differently with the same content. Although I would think the x264 ones would be a good ballpark starting point for an x265 --tune animation.

Getting an actual --tune animation preset would be a good thing! HEVC should be a particularly strong codec for hand-drawn animation. I'm curious to see what some of the HEVC v2 features for screen recording can do when applied to cel animation, since they have similar frequency characteristics.

Motenai Yoda
4th May 2016, 15:17
@RainyDog Aq2 looks to me as it vary a bit too much and be a bit inconsistent between frames (ie the same zone can be boosted in a frame and not in the next)

Different codecs can behave quite differently with the same content. Although I would think the x264 ones would be a good ballpark starting point for an x265 --tune animation.
Yep indeed IMHO with vp8/9 aq1 is bad, aq0 good, and aq2 in the middle.

Why would increasing psy-rd increase encoded anime quality?
I was referring to RL content in that part, for anime I usually low psy-rd to about 1.0 and leave psy-rdoq and rdoq-level at 2

gamebox
5th May 2016, 10:19
Did some testing after the CPU finally freed up... :)

AQ Strength 2 and Psy-RD 2.0 (Rdoq 1.0) brought appreciable improvements in subjective quality. My starting settings of 1 for both were very conservative. Regarding speedup, things are as follows:
- 0.46 fps was my starting speed with no speedups whatsoever
- 0.55 fps with --limit-modes, --limit-refs 1, --ctu 32, --max-tu-size 16, --merange 24, --qg-size 16
- 1.16 fps all above plus --no-rect and --no-amp
- 1.85 fps all above and --early-skip

The clip on which I tested was 480p (DVD source) with decent amount of movement/complexity, target bitrate is 1.5 Mbps for the entire movie (it came to about 1.7 in this particular segment during my first test - compression of an entire movie - so I did this short test at 1.7). x264 with CRF 18 required somewhat over 2 Mbps for the entire movie.

Unfortunately, --early-skip brought significant visual distortion, especially around less pronounced edges with a fair amount of movement frame-by-frame, so I won't use it. Switching off --rect and --amp also noticeably reduced visual quality in similar areas, much more than I expected (low resolution, short duration and a fair amount of motion in the clip might have made it more obvious than otherwise would be). With smaller CTUs I got overall sharper details, but also more ringing throughout the background, probably as pronounced as the improvement where it matters. The ringing is not unbearable, just more obvious and one step closer visually to x264 with deblock -1. SAO also proved useful in reducing ringing around more pronounced edges, even though it does reduce details elsewhere too.

However, improvement in details and less pronounced edges I noticed might be primarily attributed to stronger AQ and RD instead of smaller CTUs, so further testing is required. For the time being, it seems reasonable to stick to a minimal range of speedups, certainly no --early-skip and to leave rect/amp both on at least for SD. As I said, this is an incomplete test on a low resolution clip where any distortion is more obvious, and there are significant areas in the background with rather low complexity where largest CTUs might make a difference and save precious bits for complex parts of the scene. More testing remains to be done, I'll also try using the switches --fast-intra, --tskip, --tskip-fast, --rd 4, --subme 6, perhaps even --aq-mode 3.

gamebox
6th May 2016, 12:10
Small update after some further testing of: --limit refs 3, --rd4, --fast-intra, --subme6, --rd-refine, --no-strong-intra-smoothing, --constrained-intra, --sao-non-deblock, --aq-mode 3.

Switching off rect and amp results in slight loss of "edge fidelity". The cause might be the SD resolution, but I definitely see it, and it has more influence on visual quality than limit-refs for instance. I will re-evaluate rect/amp's influence on quality for HD clips, though. Most options bringing appreciable speedup (through significant complexity reduction), quite expectedly, result in obvious quality losses. So, rd4, fast-intra, constrained intra, and even subme6 are unacceptable for me at the moment. AQ mode 3 tested for curiosity, but the results are worse than AQ 2 (the clip is dominated by bright colors and scenery, so no surprises here) - less pronounced edges/details retention, especially in zones of strong motion, is my main goal. Strangely, RD refine didn't bring any improvement, it even looked worse.

What brought positive impact was limit-refs 1, and unexpectedly limit-refs 3. Very little impact on quality with appreciable speedup (5% for the first and over 30% for the second option). Sao-non-deblock also positively influenced quality, slightly slowing down the process (about 2%), so did No-strong-intra-smoothing with CTUs 64, bringing slight speedup (also about 2%). Can anyone explain to me what Sao-Non-Deblock does - I see it is off by default and I guess there is a reason to it - and I also want to ask if No-strong-intra-smoothing influences smaller CTUs as well?

After watching these new test clips I also decided to limit CTUs to 32 again, as many recommended. That does slightly decrease quality - primarily in backgrounds - but positively impacts important details. Encoding with a limited CTU size is also about 20% faster. Again, will re-evaluate for HD as I will aim for rather low bitrates (2Mbps for 720p, 3Mbps for 1080p - I encoded in x264 using 3-4 and 5-6 for those respectively).

The system used is based on Athlon II X3 425 CPU. It is unlocked to quad-core Phenom II (2,7GHz per core, 2MB L2, 6MB L3). RAM is DDR2 at 1066MHz, dual-channel. @MeteorRain: My upgrade options at the moment are limited without some appreciable budget. Intel CPU would mean a new Motherboard and RAM as well, so would upgrading to DDR3 with current CPU. This CPU costed about 25$ secondhand, now I'm looking for another cheap quad-core CPU to use in parallel as I have some spare hardware (AM2+ motherboard, RAM) laying around.

netsky123
2nd July 2016, 05:58
here's my settings for hq anime encode:
x265-10b --y4m --preset slower --crf 17.0 --ctu 32 --max-tu-size 16 --tu-intra-depth 3 --tu-inter-depth 3 --rdpenalty 2
--me 3 --subme 5 --merange 25 --b-intra --no-rect --no-amp --ref 5 --weightb --keyint 720 --min-keyint 1 --bframes 10
--aq-mode 1 --aq-strength 1.1 --rd 5 --psy-rd 1.2 --psy-rdoq 14.0 --rdoq-level 1 --no-sao --no-open-gop --rc-lookahead 80
--scenecut 40 --max-merge 4 --qcomp 0.78 --no-strong-intra-smoothing --deblock -2:-2 --input-depth 10 -o output.265 -

for 1080p 10bit encoding,i can get about 2fps on my intel QBC1 processer.

me=3 is pretty fast now,i used to use me=2 because me=3 is too slow for me,and the merange=57 by default is just insane ,is a great impact on speed but not be helpful on quality.

SAO,it works like a strong blur filter,you gonna loss a lot of details.DON't use it.

benwaggoner
6th July 2016, 00:39
here's my settings for hq anime encode:
x265-10b --y4m --preset slower --crf 17.0 --ctu 32 --max-tu-size 16 --tu-intra-depth 3 --tu-inter-depth 3 --rdpenalty 2
--me 3 --subme 5 --merange 25 --b-intra --no-rect --no-amp --ref 5 --weightb --keyint 720 --min-keyint 1 --bframes 10
--aq-mode 1 --aq-strength 1.1 --rd 5 --psy-rd 1.2 --psy-rdoq 14.0 --rdoq-level 1 --no-sao --no-open-gop --rc-lookahead 80
--scenecut 40 --max-merge 4 --qcomp 0.78 --no-strong-intra-smoothing --deblock -2:-2 --input-depth 10 -o output.265 -

for 1080p 10bit encoding,i can get about 2fps on my intel QBC1 processer.

me=3 is pretty fast now,i used to use me=2 because me=3 is too slow for me,and the merange=57 by default is just insane ,is a great impact on speed but not be helpful on quality.

SAO,it works like a strong blur filter,you gonna loss a lot of details.DON't use it.
What build are you using? The new --rskip and other changes in the last month have probably changed things significantly for optimum anime encoding. I'd think using --qg-size 16 would also be helpful.