Log in

View Full Version : x265 HEVC Encoder


Pages : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 [135] 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197

LigH
7th February 2019, 09:08
So if there is a v3.0 release since 2 weeks, why do I still receive an RC version ... did I not update to the "tip"? Usually I do...

Ah, there was no back merge. Stable and default branch are still distinct.

jlpsvk
7th February 2019, 10:28
So if there is a v3.0 release since 2 weeks, why do I still receive an RC version ... did I not update to the "tip"? Usually I do...

Ah, there was no back merge. Stable and default branch are still distinct.

Barought wrote that few days ago:

Our plan is to continue to use 3.0_RC on the default branch and have completed tags only on the stable branch. So we don't intend to merge back.

So RC14 is newer than 3.0+1

pradeeprama
7th February 2019, 12:02
Barought wrote that few days ago:

Our plan is to continue to use 3.0_RC on the default branch and have completed tags only on the stable branch. So we don't intend to merge back.

So RC14 is newer than 3.0+1

We are starting to use 3.0_Au (Gold) to signal that the default branch has moved to gold state. We are just trying to avoid merging back from stable to default as it is good practice to avoid merges. Hope this isn't too confusing!

benwaggoner
7th February 2019, 19:38
HVS is more sensible to yellow-red chrominance (>>550nm)
than violet-blue (<<500nm). Red has always been a mess in encoding, especially when encoding in yuv420 where chroma is subsampled and red pictures appear in their glorious blockiness.

There's still a lot of work to be done to optimize encoders to accomodate for perception: we should improve quantization both in dark picture and red-dominant picture, but this depends also on the viewing conditions (i.e. not so important in small mobile screens)

Not to mention higher order of complexity optimizations like saliency based optimizations or similar.

I have developed many times models that change params using zones-like params to improve the performance of a given encoder. This is one of those cases.
Yeah, very flat gradients in red and blue have always been hard (since luma is mainly green, green is pretty much non subsampled).

Using --crqpoffs -* would help red without spending more bits on blue. It makes a certain logical sense that, when decreasing chroma QP, it would make sense to reduce 2 red for every blue given the relatively higher sensitivity to red. However, red also accounts for a greater portion of luma than blue which likely would offset that some.

Encoding in 10-bit instead of 8-bit should help as well. Sometimes these problems can be as much in the 10-bit Y'CbCr -> 8-bit RGB conversion on a device than in the encoding itself. Display controllers that just truncate the 8-bit instead of dithering can cause all kinds of issues even in SDR. Dithering should be done even when doing a 16-235 Y'CbCr to 0-255 RGB conversion.


To humorously paraphrase Gen. Robert H. Barrow, USMC

"Hobbyists talk about average bitrates and per-title metrics, but experts study color volume algorithms and rate control."

benwaggoner
7th February 2019, 19:42
should be optional and do no harm, so you can probably use it without running into issues.

All it does is remove duplicates of identical metadata; the majority of frames in a given title will be identical to the prior frame in display order. It'll still ensure that every IDR has its metadata.

This can save some bits (only material at very low bitrates), and potentially some work by the tone mapper. Since display<>decode order, you'll get cases where the tone mapper could find out in advance that several frames in a row have identical color characteristics.

katzenjoghurt
7th February 2019, 21:43
HVS is more sensible to yellow-red chrominance (>>550nm)
than violet-blue (<<500nm). Red has always been a mess in encoding, especially when encoding in yuv420 where chroma is subsampled and red pictures appear in their glorious blockiness.

There's still a lot of work to be done to optimize encoders to accomodate for perception: we should improve quantization both in dark picture and red-dominant picture, but this depends also on the viewing conditions (i.e. not so important in small mobile screens)

Not to mention higher order of complexity optimizations like saliency based optimizations or similar.

I have developed many times models that change params using zones-like params to improve the performance of a given encoder. This is one of those cases.

Thanks for the confirmation, sonnati. :)
I'll continue sticking to zones then or live with rising the overall bitrate beyond general need if the pain becomes too much.

benwaggoner
8th February 2019, 00:51
Thanks for the confirmation, sonnati. :)
I'll continue sticking to zones then or live with rising the overall bitrate beyond general need if the pain becomes too much.
And when you find examples that need zones, grab that snippet and open an Issue with MCW. An encoder should be able to detect when this would be an issue and then compensate for it.

redbtn
12th February 2019, 20:50
I've made some tests for 1080p with --tune-psnr, --tune-ssim, and no-tune for different variants --ctu --max-tu-size --qg-size

Does this mean that for 1080p encoding default settings (--ctu 64 --max-tu-size 32 --qg-size 32) is the best option?

Results:

no-tune
===
--ctu 64 --max-tu-size 32 --qg-size 32 Global PSNR: 55.009, SSIM Mean Y: 0.9965665 (24.643 dB) higher value
--ctu 32 --max-tu-size 32 --qg-size 16 Global PSNR: 54.982, SSIM Mean Y: 0.9965609 (24.636 dB)
--ctu 32 --max-tu-size 32 --qg-size 8 Global PSNR: 54.866, SSIM Mean Y: 0.9964265 (24.469 dB)
--ctu 32 --max-tu-size 16 --qg-size 16 Global PSNR: 54.908, SSIM Mean Y: 0.9964738 (24.527 dB)
--ctu 32 --max-tu-size 16 --qg-size 8 Global PSNR: 54.791, SSIM Mean Y: 0.9963344 (24.359 dB)
===

tune-psnr
===
--ctu 64 --max-tu-size 32 --qg-size 32 Global PSNR: 54.936 higher value
--ctu 32 --max-tu-size 32 --qg-size 16 Global PSNR: 54.910
--ctu 32 --max-tu-size 32 --qg-size 8 Global PSNR: 54.806
--ctu 32 --max-tu-size 16 --qg-size 16 Global PSNR: 54.869
--ctu 32 --max-tu-size 16 --qg-size 8 Global PSNR: 54.765
===

tune-ssim
===
--ctu 64 --max-tu-size 32 --qg-size 32 SSIM Mean Y: 0.9964914 (24.549 dB) higher value
--ctu 32 --max-tu-size 32 --qg-size 16 SSIM Mean Y: 0.9964839 (24.539 dB)
--ctu 32 --max-tu-size 32 --qg-size 8 SSIM Mean Y: 0.9963653 (24.395 dB)
--ctu 32 --max-tu-size 16 --qg-size 16 SSIM Mean Y: 0.9964291 (24.472 dB)
--ctu 32 --max-tu-size 16 --qg-size 8 SSIM Mean Y: 0.9963079 (24.327 dB)

Selur
12th February 2019, 20:52
Does this mean that for 1080p encoding default settings (--ctu 64 --max-tu-size 32 --qg-size 32) is the best option?
to state the obvious: only if your quality perception aligns with PSM and SSIM Mean,..

benwaggoner
12th February 2019, 21:05
to state the obvious: only if your quality perception aligns with PSM and SSIM Mean,..
The modern codec design process biases heavily for fixed QP maximizing mean PSNR, so adaptive quant and psychovisual optimizations are generally only helpful when targeting subjective quality. Which is all viewers care about. I’ve seen lots of beautiful PSNR plots turn into lousy looking video.

redbtn
12th February 2019, 21:15
to state the obvious: only if your quality perception aligns with PSM and SSIM Mean,..

The modern codec design process biases heavily for fixed QP maximizing mean PSNR, so adaptive quant and psychovisual optimizations are generally only helpful when targeting subjective quality. Which is all viewers care about. I’ve seen lots of beautiful PSNR plots turn into lousy looking video.

I do not have enough knowledge yet to make difficult conclusions. I'm just trying to find the optimal settings for 4K HDR -> 1080p HDR.
Which option can you advise?

And one more question. Can I somehow optimize the settings for quality improvement without significantly reducing speed? Or will these settings be enough? My traget bitrate 20-25mbs

--preset placebo --crf 12 --profile main10 --output-depth 10 --level-idc 51 --sar 1:1 --colorprim 9 --colormatrix 9 --transfer 16 --range limited --master-display "G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,1)" --hdr --hdr-opt --chromaloc 2 --repeat-headers --hrd --max-cll "0,0" --min-luma 0 --max-luma 1023 --subme 7 --me star --merange 48 --limit-tu 4 --max-merge 4 --limit-modes --limit-refs 3 --rd 6 --rd-refine --no-tskip --rskip --no-cutree --no-sao --no-open-gop --no-b-pyramid --no-strong-intra-smoothing --vbv-bufsize 160000 --vbv-maxrate 160000 --cbqpoffs -2 --crqpoffs -2 --min-keyint 5 --keyint 240 --ipratio 1.30 --pbratio 1.20 --deblock -3:-3 --qcomp 0.65 --aq-mode 2 --aq-strength 0.8 --psy-rd 1.5 --psy-rdoq 5

excellentswordfight
13th February 2019, 16:08
I do not have enough knowledge yet to make difficult conclusions. I'm just trying to find the optimal settings for 4K HDR -> 1080p HDR.
Which option can you advise?

And one more question. Can I somehow optimize the settings for quality improvement without significantly reducing speed? Or will these settings be enough? My traget bitrate 20-25mbs

--preset placebo --crf 12 --profile main10 --output-depth 10 --level-idc 51 --sar 1:1 --colorprim 9 --colormatrix 9 --transfer 16 --range limited --master-display "G(13250,34500)B(7500,3000)R(34000,16000)WP(15635,16450)L(10000000,1)" --hdr --hdr-opt --chromaloc 2 --repeat-headers --hrd --max-cll "0,0" --min-luma 0 --max-luma 1023 --subme 7 --me star --merange 48 --limit-tu 4 --max-merge 4 --limit-modes --limit-refs 3 --rd 6 --rd-refine --no-tskip --rskip --no-cutree --no-sao --no-open-gop --no-b-pyramid --no-strong-intra-smoothing --vbv-bufsize 160000 --vbv-maxrate 160000 --cbqpoffs -2 --crqpoffs -2 --min-keyint 5 --keyint 240 --ipratio 1.30 --pbratio 1.20 --deblock -3:-3 --qcomp 0.65 --aq-mode 2 --aq-strength 0.8 --psy-rd 1.5 --psy-rdoq 5
Imo, the bitrate you are targeting is so high that most of your settings are way overkill. You will spend alot of cpu cycles for virtually no quality gain, you will be well beyond the dimishing return point.

Have you actually tried those settings? It wouldn't surprise me that placebeo crf12 could render a file bigger then the original 4k one, and you will be lucky if you do it above 1fps. I'm sure you have your reasons for this scenario, but if quality was that important to me, I would just go with the original.

redbtn
13th February 2019, 16:54
Imo, the bitrate you are targeting is so high that most of your settings are way overkill. You will spend alot of cpu cycles for virtually no quality gain, you will be well beyond the dimishing return point.

Have you actually tried those settings? It wouldn't surprise me that placebeo crf12 could render a file bigger then the original 4k one, and you will be lucky if you do it above 1fps. I'm sure you have your reasons for this scenario, but if quality was that important to me, I would just go with the original.
I tried this settings only without (--rd 6 --rd-refine --opt-cu-delta-qp), with (--rd 3 --no-rskip). For example for file bitrate 57 Mbs size 45 GB output bitrate 26 Mb/s size 21 GB with 1.4fps encode speed. I think with new settings it's be around 0.9-1fps on my 6 core i5 8400.
I view at 4K UHD TV, but it is connected to a computer as duplicated with a monitor with a resolution of 1080p because windows 10 cannot normally scale many applications in 4K (it's sadly). Therefore, I decided that there is no point in reducing the image via MadVR in real time. Although I have a GTX 1080 which does it without problems. And also save about 50 percent of the HDD space.

Boulder
13th February 2019, 18:30
You can safely use --rskip, it will speed up things a lot and you won't notice any difference. I'd also use something like --subme 3.

redbtn
13th February 2019, 22:13
Boulder thank you!

This is probably a silly question, but here goes anyway: if I use --hdr-opt, do I need to feed the encoder with 10-bit data or is 16-bit data as good if the source is a standard UHD with HDR? I always process things in 16-bit domain and let the encoder dither down to 10 bits.

I would let x265.exe do the dithering, 'cause other dithering options like the Floyd Steinberg error diffusion may have a nicer look, but they could increase the bitrate required by x265. The built in dithering filter in x265 is supposed to dither everything down to the target bit depth without introducing banding. Blocks and macro blocks dithered by x265 are more likely to be recognised during the motocompensation by x265 than the ones dithered using a third party dithering method, therefore compression should be better.

In a nutshell, let x265 do the dithering and always pipe to it the highest bit depth you have, unless you like a specific dithering method and you have enough bitrate.

Im encode 10bit HDR, so if I want to do the same, pipe 16bit to x265 with the flag --dither, I need LWLibavSource format="YUV420P10" or format="YUV420P16" ?
And i should remove this? clip = core.resize.Bicubic(clip=clip, format=vs.YUV420P10, range_s="limited")

Thank you!

My VS scrypt:

clip = core.lsmas.LWLibavSource(source="video.mkv", format="YUV420P10", cache=1)
clip = core.resize.Point(clip, matrix_in_s="2020ncl",range_s="limited")
clip = core.std.AssumeFPS(clip, fpsnum=24000, fpsden=1001)
clip = core.std.SetFrameProp(clip=clip, prop="_ColorRange", intval=1)
clip = core.std.CropRel(clip=clip, left=0, right=0, top=276, bottom=276)
clip = core.fmtc.resample(clip=clip, kernel="spline64", w=1920, h=804, interlaced=False, interlacedd=False)
clip = core.resize.Bicubic(clip=clip, format=vs.YUV420P10, range_s="limited")
clip.set_output()

hevc_enocder
14th February 2019, 01:33
--crf 18 --profile main10 --level-idc 5.1 --output-depth 10 --rd 4 --ctu 32 --amp --aq-mode 2 --vbv-bufsize 160000 --vbv-maxrate 160000 --ipratio 1.3
--pbratio 1.2 --no-cutree --subme 7 --me star --merange 24 --max-merge 3 --bframes 12 --rc-lookahead 60 --lookahead-slices 4 --ref 6 --min-keyint 24 --keyint 240 --deblock -3:-3
--no-sao --no-strong-intra-smoothing --high-tier

Hi guys, I would like to know, how to improve my setting, Iam quite satified, but there is problem with red color and if there is sometig to change for encoding normal new movies.

Someone told me, that there is no reason to use subme 7, I am use to from x264 use subme 10 so I thought that it could be usefull.

Thank you for your advice and some explonation.

Selur
14th February 2019, 06:20
there is problem with red color
You probably did use the search in this thread and already tried using negative chroma offsets (--cbqpoffs and --crqpoffs), right?

RainyDog
14th February 2019, 09:09
--crf 18 --profile main10 --level-idc 5.1 --output-depth 10 --rd 4 --ctu 32 --amp --aq-mode 2 --vbv-bufsize 160000 --vbv-maxrate 160000 --ipratio 1.3
--pbratio 1.2 --no-cutree --subme 7 --me star --merange 24 --max-merge 3 --bframes 12 --rc-lookahead 60 --lookahead-slices 4 --ref 6 --min-keyint 24 --keyint 240 --deblock -3:-3
--no-sao --no-strong-intra-smoothing --high-tier

Hi guys, I would like to know, how to improve my setting, Iam quite satified, but there is problem with red color and if there is sometig to change for encoding normal new movies.

Someone told me, that there is no reason to use subme 7, I am use to from x264 use subme 10 so I thought that it could be usefull.

Thank you for your advice and some explonation.

In x264, RDO is tied to subme. So higher levels of subme also incorporate higher levels of RDO.

Subme and RD(O) level are separate settings in x265 and I personally don't think it's worth using anything higher than subme 5.

Whenever I've tested, it's never actually been worth it to use anything higher than subme 3 really which is the lowest subme to include chroma residual cost.

I'd definitely choose --rd 5 (or 6 since they're the same) plus --rd-refine over subme 7 everytime. Though that will half your encoding time so I rarely ever bother with that either :p

excellentswordfight
14th February 2019, 10:35
I tried this settings only without (--rd 6 --rd-refine --opt-cu-delta-qp), with (--rd 3 --no-rskip). For example for file bitrate 57 Mbs size 45 GB output bitrate 26 Mb/s size 21 GB with 1.4fps encode speed. I think with new settings it's be around 0.9-1fps on my 6 core i5 8400.
I view at 4K UHD TV, but it is connected to a computer as duplicated with a monitor with a resolution of 1080p because windows 10 cannot normally scale many applications in 4K (it's sadly). Therefore, I decided that there is no point in reducing the image via MadVR in real time. Although I have a GTX 1080 which does it without problems. And also save about 50 percent of the HDD space.
As I said, I'm sure that you have your reasons, but I dont find it that appealing to spend 48h per title to get a filesize that probably would look undisguisable at half the bitrate.

I just did a quick test on tears of steal with those settings and compared it an encode at native res with slow preset (+ --deblock -1:-1 --no-sao --no-strong-intra-smoothing)... the 1080p one ended up at 40Mbps (downscaled with spline), and the 2160 one at 25, the one at at native res was sharper and was about 3x faster to encode...

Ofc, if you find the time/compression/quality to be a good tradeoff for you go ahead with that. But when you even have a 4k TV I would jut go with a 4k workflow cause you can get away with 4k re-encodes with that bitrate target. Not gonna tell you what to do, but I would at least play with different preset levels and crf values and look at the actually video to see if you can find a better "sweetspot". Cause it sounds awful to me o spend 2days for an encode, just to get a bloated file that still has a loss of detail/sharpness (cause of downscaling).

LigH
14th February 2019, 14:19
x265 3.0 Au+4-dcbec33bfb0f (https://www.mediafire.com/file/3js85elpouv0cxl/x265_3.0_Au%2B4-dcbec33bfb0f.7z) (MSYS2, MinGW32 + GCC 7.4.0 / MinGW64 + GCC 8.2.1)

Gold!

poller
18th February 2019, 18:14
is there a way to compile the x64 version without avx512?

it only increases the size of the library and i will not use avx512 anyway. for now i'm sticking to v2.7
thanks :)

MeteorRain
18th February 2019, 18:42
is there a way to compile the x64 version without avx512?

Not sure how the size of the library matters, but anyway.

Open \source\common\x86\asm-primitives.cpp, find X265_CPU_AVX512 and note all the avx512 asms.

Remove them from *.asm files, remove the X265_CPU_AVX512 code block, and then try to compile them.

benwaggoner
18th February 2019, 21:25
I just did a quick test on tears of steal with those settings and compared it an encode at native res with slow preset (+ --deblock -1:-1 --no-sao --no-strong-intra-smoothing)... the 1080p one ended up at 40Mbps (downscaled with spline), and the 2160 one at 25, the one at at native res was sharper and was about 3x faster to encode...

Ofc, if you find the time/compression/quality to be a good tradeoff for you go ahead with that. But when you even have a 4k TV I would jut go with a 4k workflow cause you can get away with 4k re-encodes with that bitrate target. Not gonna tell you what to do, but I would at least play with different preset levels and crf values and look at the actually video to see if you can find a better "sweetspot". Cause it sounds awful to me o spend 2days for an encode, just to get a bloated file that still has a loss of detail/sharpness (cause of downscaling).
Yeah, an important thing about 4K encoding is that the pixels are getting near the minimize visible size in typical viewing environments, especially with non-HDR content. So small artifacts that could be visible in 1080p content become MUCH less so at 4K. I've heard a lot of "but 4K will take 4x the bitrate of 1080p), but the frequency distribution gets softer and latitude for non-perceptible distortion gets greater. Doubling bitrate from 1080p to 2160p is typically sufficient to get value out of the extra pixels without any regressions.

In fact a satisfactory H.264 8-bit 1080p bitrate will generally be a satisfactory HEVC 10-bit 2160p bitrate. The "4K will kill the internets!" panic was badly overblown.

poller
18th February 2019, 22:31
Not sure how the size of the library matters, but anyway.

Open \source\common\x86\asm-primitives.cpp, find X265_CPU_AVX512 and note all the avx512 asms.

Remove them from *.asm files, remove the X265_CPU_AVX512 code block, and then try to compile them.

that's how i started, but i thought there might be a simpler way.

but it does work, size is only slighter bigger than v2.7 now.

thanks.

benwaggoner
18th February 2019, 23:00
but it does work, size is only slighter bigger than v2.7 now.
I am curious as to why that is worth this effort.

poller
19th February 2019, 18:32
I am curious as to why that is worth this effort.
well, it didn't take really long, maybe 30 minutes.
i need ffmpeg for recording from an emulator, and i don't want to see the ffmpeg.dll being 5 times bigger than the actual emu. :stupid:
so i'm trying to keep it small.


and now... more questions. :p

first, v3.0 is slower and produces bigger output than previous versions. i did read the changelog, but fail to see what could be the reason (yes, noob here). :confused:
/edit: ok, it seems to be aq-mode=2


second, x86 builds by LigH are about 10% (!) faster than mine. is there some magical compiler flag? i tried -O2 and -Ofast... no luck. -Ofast being the slowest of them actually.
x64 builds are at the same speed.

:thanks:

LigH
19th February 2019, 18:52
No magical flags. Just out of media-autobuild suite (more or less, negligible differences in the handling). Do we use the same compiler?

Boulder
19th February 2019, 18:56
v3.0 changed some presets, so maybe the change in speed comparing to v2.7 comes from there. I don't know if the x265 docs are helpful in checking out what changed (I doubt :p)

benwaggoner
19th February 2019, 20:35
v3.0 changed some presets, so maybe the change in speed comparing to v2.7 comes from there. I don't know if the x265 docs are helpful in checking out what changed (I doubt :p)
Specifically, slower got a lot slower, and veryslow is even more very slow. Slower was a nice sweet spot quality/perf step; I think it would have made more sense to add a superslow or something.

Try using slow and seeing if that helps.

poller
19th February 2019, 20:37
No magical flags. Just out of media-autobuild suite (more or less, negligible differences in the handling). Do we use the same compiler?
strange. i now tested with GCC 7.3 from MSYS2 (it seems this is the one you are using), the executable is a lot bigger and even slightly slower than my other builds (GCC 4.9.3).
i am using the default values of x265.exe for testing.
this is beyond me. :confused:


v3.0 changed some presets, so maybe the change in speed comparing to v2.7 comes from there. I don't know if the x265 docs are helpful in checking out what changed (I doubt :p)
as mentioned, if i set --aq-mode=1 speed of v3.0 is pretty much the same as before, also the size of output.
no idea about the quality, there might be some other changes as you say.

benwaggoner
19th February 2019, 21:45
strange. i now tested with GCC 7.3 from MSYS2 (it seems this is the one you are using), the executable is a lot bigger and even slightly slower than my other builds (GCC 4.9.3).
i am using the default values of x265.exe for testing.
this is beyond me. :confused:
By "default values" you would be implicitly using --preset medium. No idea if that is what you want to be using for your scenario.

as mentioned, if i set --aq-mode=1 speed of v3.0 is pretty much the same as before, also the size of output.
no idea about the quality, there might be some other changes as you say.
--aq-mode 2 is the default in 3.0, but was 1 in previous versions. I wouldn't think it would change performance THAT much.

poller
19th February 2019, 22:55
it seems i started some confusion here.

i tested another video, about the same results.

--aq-mode=1
own builds: 40.5 sec
LigH builds: 35.5 sec
so my builds are even more than 10% slower, no matter what preset i use. :( so annoying.
GCC 9.01, GCC 8.2... always the same bad speed.

--aq-mode=2
own builds: 47.5 sec
LigH builds: 40.5 sec
so yes, for MY short low res test videos mode 2 is slower.
that might be different for other input files.

MeteorRain
20th February 2019, 09:45
second, x86 builds by LigH are about 10% (!) faster than mine. is there some magical compiler flag? i tried -O2 and -Ofast... no luck. -Ofast being the slowest of them actually.
x64 builds are at the same speed.

:thanks:

Could be some profiling magic. I'm unsure if LigH uses any profiling when compiling the binary though.

I used it once, made it slower. And I haven't used it since, but things might have changed.

Selur
20th February 2019, 10:03
Nope, already answered that he simply uses media-autobuild suite (see:https://forum.doom9.org/showthread.php?p=1866145#post1866145), so no profiling.

Boulder
20th February 2019, 10:31
Some build log could be useful, maybe there is something missing. I'd expect that if assembler was not used, the difference would be much bigger though.

poller
20th February 2019, 21:24
i tried hard with GCC again.

seconds. lower is better. :)

137.0 no assembly
47.0 (default)
45.5 (PGO build) -mtune=ivybridge (default here is -O3 which makes 1st pass PGO .exe crash, thus no better speed i guess)
44.5 (PGO build) -mtune=ivybridge -O2
43.9 (PGO build) -mtune=ivybridge -funroll-loops -finline-functions -ftree-loop-vectorize -O2
39.5 LigH

so i get little improvement with all that fiddling, but still far away from LigH's GCC builds. :(


giving up here, i have no ideas left.

LigH
21st February 2019, 00:14
OK, I forgot little details I edited a long time ago, while testing some compiling issues with a faulty compiler version. A leftover string is:

export CXXFLAGS="-march=pentium4 -mtune=generic"

for the 32-bit compilation (which is still quite generic, just a sensible minimum). That might bring a little advantage. For the 64-bit compilation, the CXXFLAGS is empty.

Furthermore, for the 32-bit compilation, assembly is disabled for 10 and 12 bit precision cores, but enabled for the 8 bit core.

WhatZit
21st February 2019, 10:39
Supply --svt in the command line to use the SVT-HEVC encoder.

Never expected that one! :confused:

From http://x265.org/x265-svt-hevc-house/:

With changeset a41325fc854f, the x265 library can invoke the SVT-HEVC library for encoding through the —svt option. We have mapped presets and command-line options supported by the x265 application into the equivalent options of SVT-HEVC, and have added a few specific options that are available only when the SVT-HEVC library is invoked. This page in our documentation describes the steps to build, and invoke the SVT-HEVC library in more detail.

Our reason for this integration was to enable our users to evaluate additional relative trade-offs between performance and compression efficiency while working behind the familiar API of the x265 library. In the long term, we plan to leverage this integration to further improve x265’s ability to handle real-time and low turn-around scenarios in pure software; this is the space that SVT-HEVC was focused on. In parallel, we will continue to innovate on our flagship presets that are used in offline encoding where x265 dominates. You can expect to see these changes in the coming releases of x265, increasing the reach of open-source for video compression!

Am I being cynical to suggest that Multicoreware couldn't achieve such speed optimisations on their own, so they formed this "synergy"?

nevcairiel
21st February 2019, 10:43
Personally I think its stupid to incorporate another encoder into the x265 "frontend". If one wanted to use different encoders, one would use say ffmpeg, or just use them directly. x265 should be x265, and nothing else. But oh well. Probably some business driving over common sense. :)

shinchiro
21st February 2019, 11:27
Lol, good to know I'm not the only one who thinks x265's decision to include another encoder inside itself is stupid. Well, if there's money involved here, I'm not even surprised

benwaggoner
21st February 2019, 21:32
Lol, good to know I'm not the only one who thinks x265's decision to include another encoder inside itself is stupid. Well, if there's money involved here, I'm not even surprised
From the link, it sounds like the big plan is to start incorporating use of certain SVT-HEVC features/tools within x265. Having a highly accelerated coarse motion search mode could help. Kind of like the OpenGL/CUDA experiments with x264 a while ago.

x265 has a TON of features where it can take input from a first pass and then refine it. Some of those don't require the stream be made with x265, and a few work with H.264 sources IIRC.

Forteen88
22nd February 2019, 09:57
I just wanted to say that I did a little x265 speed-test, one compile vs another,
x265-3.0_Au+7-cb3e172_vs2017-AVX2 (msystem) vs x265-v3.0_Au+7-cb3e172a5f51-SVT-win64 [ICC 1900][MSVC 1916 Multilib][SVT][64 bit].

I encoded a 44 second long cartoon animation, 00096.m2ts, with this setting:
x265.exe --crf 18 --preset veryslow --output-depth 10 --rdoq-level 0 --psy-rdoq 0 --aq-mode 1 --aq-strength 0.4 --qcomp 0.65 --bframes 16 --rc-lookahead 48 --ref 6 --min-keyint 24 --keyint 240 --frame-threads 1 --colormatrix bt709 --deblock -2:-2 --no-sao --psy-rd 0.4 --tskip --tskip-fast --tu-inter 4 --tu-intra 4 --frames 1066


x265-3.0_Au+7-cb3e172_vs2017-AVX2 (msystem) Duration: 00:53:41
x265-v3.0_Au+7-cb3e172a5f51-SVT-win64 [ICC 1900][MSVC 1916 Multilib][SVT][64 bit] Duration: 00:53:32

Not a big difference in speed, considering I have a Intel Core i5-5200U CPU (I thought that the ICC 1900-compile would be much faster).

EDIT: By "much faster", I meant much faster than this encode was, I meant like 10% faster than the non-ICC compile.

Selur
22nd February 2019, 12:13
I thought that the ICC 1900-compile would be much faster
to be frank I would have been surprised using a different compiler to have much of an impact,...

poller
22nd February 2019, 15:29
OK, I forgot little details I edited a long time ago, while testing some compiling issues with a faulty compiler version. A leftover string is:

export CXXFLAGS="-march=pentium4 -mtune=generic"

for the 32-bit compilation (which is still quite generic, just a sensible minimum). That might bring a little advantage. For the 64-bit compilation, the CXXFLAGS is empty.

Furthermore, for the 32-bit compilation, assembly is disabled for 10 and 12 bit precision cores, but enabled for the 8 bit core.
well, here not even -march=corei7 did help much.
assembly needs to be disabled for x86 high bit, it does not compile when enabled.


Not a big difference in speed, considering I have a Intel Core i5-5200U CPU (I thought that the ICC 1900-compile would be much faster).

the same here, actually, all x64 builds (from the net) i tested are pretty much on the same level, my own builds included and also the ICC compile.

but i see differences in the x86 builds. but honestly, not many people will use those anyway.

LigH
22nd February 2019, 17:51
One more build to compare, with two variants:

x265 3.0_Au+7-cb3e172a5f51 MABS (https://www.mediafire.com/file/qekmo78ifrr9e9k/x265_3.0_Au+7-cb3e172a5f51_MABS.7z/file) compiled with media-autobuild_suite only (EXE only, no DLL)

x265 3.0_Au+7-cb3e172a5f51 (https://www.mediafire.com/file/bnnonqapsk0gp7m/x265_3.0_Au+7-cb3e172a5f51.7z/file) compiled with custom build scripts to obtain libx265.dll too, running in interactive MinGW32 / MinGW64 shells

poller
22nd February 2019, 22:05
nice, some small test:

x265_3.0_RC+14-46b84ff665fd
20.5 seconds
cpuid=1049583 / frame-threads=3 / wpp / no-pmode / no-pme / no-psnr / no-ssim / log-level=2 / input-csp=1 / input-res=352x288 / interlace=0 / total-frames=2101 / level-idc=0 / high-tier=1 / uhd-bd=0 / ref=3 / no-allow-non-conformance / no-repeat-headers / annexb / no-aud / no-hrd / info / hash=0 / no-temporal-layers / open-gop / min-keyint=25 / keyint=250 / gop-lookahead=0 / bframes=4 / b-adapt=0 / b-pyramid / bframe-bias=0 / rc-lookahead=15 / lookahead-slices=0 / scenecut=40 / radl=0 / no-splice / no-intra-refresh / ctu=64 / min-cu-size=8 / no-rect / no-amp / max-tu-size=32 / tu-inter-depth=1 / tu-intra-depth=1 / limit-tu=0 / rdoq-level=0 / dynamic-rd=0.00 / no-ssim-rd / signhide / no-tskip / nr-intra=0 / nr-inter=0 / no-constrained-intra / strong-intra-smoothing / max-merge=2 / limit-refs=3 / no-limit-modes / me=1 / subme=2 / merange=57 / temporal-mvp / weightp / no-weightb / no-analyze-src-pics / deblock=0:0 / sao / no-sao-non-deblock / rd=2 / no-early-skip / rskip / fast-intra / no-tskip-fast / no-cu-lossless / no-b-intra / no-splitrd-skip / rdpenalty=0 / psy-rd=2.00 / psy-rdoq=0.00 / no-rd-refine / no-lossless / cbqpoffs=0 / crqpoffs=0 / rc=crf / crf=21.0 / qcomp=0.60 / qpstep=4 / stats-write=0 / stats-read=0 / ipratio=1.40 / pbratio=1.30 / aq-mode=2 / aq-strength=1.00 / cutree / zone-count=0 / no-strict-cbr / qg-size=32 / no-rc-grain / qpmax=69 / qpmin=0 / no-const-vbv / sar=255 / sar-width / : / sar-height=128:117 / overscan=0 / videoformat=5 / range=0 / colorprim=2 / transfer=2 / colormatrix=2 / chromaloc=0 / display-window=0 / max-cll=0,0 / min-luma=0 / max-luma=255 / log2-max-poc-lsb=8 / vui-timing-info / vui-hrd-info / slices=1 / no-opt-qp-pps / no-opt-ref-list-length-pps / no-multi-pass-opt-rps / scenecut-bias=0.05 / no-opt-cu-delta-qp / no-aq-motion / no-hdr / no-hdr-opt / no-dhdr10-opt / no-idr-recovery-sei / analysis-reuse-level=5 / scale-factor=0 / refine-intra=0 / refine-inter=0 / refine-mv=0 / refine-ctu-distortion=0 / no-limit-sao / ctu-info=0 / no-lowpass-dct / refine-analysis-type=0 / copy-pic=1 / max-ausize-factor=1.0 / no-dynamic-refine / no-single-sei / no-hevc-aq / qp-adaptation-range=1.00

x265_3.0_Au+7-cb3e172a5f51
20.5 seconds
cpuid=1049583 / frame-threads=3 / wpp / no-pmode / no-pme / no-psnr / no-ssim / log-level=2 / input-csp=1 / input-res=352x288 / interlace=0 / total-frames=2101 / level-idc=0 / high-tier=1 / uhd-bd=0 / ref=3 / no-allow-non-conformance / no-repeat-headers / annexb / no-aud / no-hrd / info / hash=0 / no-temporal-layers / open-gop / min-keyint=25 / keyint=250 / gop-lookahead=0 / bframes=4 / b-adapt=0 / b-pyramid / bframe-bias=0 / rc-lookahead=15 / lookahead-slices=0 / scenecut=40 / radl=0 / no-splice / no-intra-refresh / ctu=64 / min-cu-size=8 / no-rect / no-amp / max-tu-size=32 / tu-inter-depth=1 / tu-intra-depth=1 / limit-tu=0 / rdoq-level=0 / dynamic-rd=0.00 / no-ssim-rd / signhide / no-tskip / nr-intra=0 / nr-inter=0 / no-constrained-intra / strong-intra-smoothing / max-merge=2 / limit-refs=3 / no-limit-modes / me=1 / subme=2 / merange=57 / temporal-mvp / weightp / no-weightb / no-analyze-src-pics / deblock=0:0 / sao / no-sao-non-deblock / rd=2 / no-early-skip / rskip / fast-intra / no-tskip-fast / no-cu-lossless / no-b-intra / no-splitrd-skip / rdpenalty=0 / psy-rd=2.00 / psy-rdoq=0.00 / no-rd-refine / no-lossless / cbqpoffs=0 / crqpoffs=0 / rc=crf / crf=21.0 / qcomp=0.60 / qpstep=4 / stats-write=0 / stats-read=0 / ipratio=1.40 / pbratio=1.30 / aq-mode=2 / aq-strength=1.00 / cutree / zone-count=0 / no-strict-cbr / qg-size=32 / no-rc-grain / qpmax=69 / qpmin=0 / no-const-vbv / sar=255 / sar-width / : / sar-height=128:117 / overscan=0 / videoformat=5 / range=0 / colorprim=2 / transfer=2 / colormatrix=2 / chromaloc=0 / display-window=0 / max-cll=0,0 / min-luma=0 / max-luma=255 / log2-max-poc-lsb=8 / vui-timing-info / vui-hrd-info / slices=1 / no-opt-qp-pps / no-opt-ref-list-length-pps / no-multi-pass-opt-rps / scenecut-bias=0.05 / no-opt-cu-delta-qp / no-aq-motion / no-hdr / no-hdr-opt / no-dhdr10-opt / no-idr-recovery-sei / analysis-reuse-level=5 / scale-factor=0 / refine-intra=0 / refine-inter=0 / refine-mv=0 / refine-ctu-distortion=0 / no-limit-sao / ctu-info=0 / no-lowpass-dct / refine-analysis-type=0 / copy-pic=1 / max-ausize-factor=1.0 / no-dynamic-refine / no-single-sei / no-hevc-aq / no-svt / qp-adaptation-range=1.00

x265_3.0_Au+7-cb3e172a5f51_MABS
23.3 seconds
cpuid=1049583 / frame-threads=3 / numa-pools=8 / wpp / no-pmode / no-pme / no-psnr / no-ssim / log-level=2 / input-csp=1 / input-res=352x288 / interlace=0 / total-frames=2101 / level-idc=0 / high-tier=1 / uhd-bd=0 / ref=3 / no-allow-non-conformance / no-repeat-headers / annexb / no-aud / no-hrd / info / hash=0 / no-temporal-layers / open-gop / min-keyint=25 / keyint=250 / gop-lookahead=0 / bframes=4 / b-adapt=0 / b-pyramid / bframe-bias=0 / rc-lookahead=15 / lookahead-slices=0 / scenecut=40 / radl=0 / no-splice / no-intra-refresh / ctu=64 / min-cu-size=8 / no-rect / no-amp / max-tu-size=32 / tu-inter-depth=1 / tu-intra-depth=1 / limit-tu=0 / rdoq-level=0 / dynamic-rd=0.00 / no-ssim-rd / signhide / no-tskip / nr-intra=0 / nr-inter=0 / no-constrained-intra / strong-intra-smoothing / max-merge=2 / limit-refs=3 / no-limit-modes / me=1 / subme=2 / merange=57 / temporal-mvp / weightp / no-weightb / no-analyze-src-pics / deblock=0:0 / sao / no-sao-non-deblock / rd=2 / no-early-skip / rskip / fast-intra / no-tskip-fast / no-cu-lossless / no-b-intra / no-splitrd-skip / rdpenalty=0 / psy-rd=2.00 / psy-rdoq=0.00 / no-rd-refine / no-lossless / cbqpoffs=0 / crqpoffs=0 / rc=crf / crf=21.0 / qcomp=0.60 / qpstep=4 / stats-write=0 / stats-read=0 / ipratio=1.40 / pbratio=1.30 / aq-mode=2 / aq-strength=1.00 / cutree / zone-count=0 / no-strict-cbr / qg-size=32 / no-rc-grain / qpmax=69 / qpmin=0 / no-const-vbv / sar=255 / sar-width / : / sar-height=128:117 / overscan=0 / videoformat=5 / range=0 / colorprim=2 / transfer=2 / colormatrix=2 / chromaloc=0 / display-window=0 / max-cll=0,0 / min-luma=0 / max-luma=255 / log2-max-poc-lsb=8 / vui-timing-info / vui-hrd-info / slices=1 / no-opt-qp-pps / no-opt-ref-list-length-pps / no-multi-pass-opt-rps / scenecut-bias=0.05 / no-opt-cu-delta-qp / no-aq-motion / no-hdr / no-hdr-opt / no-dhdr10-opt / no-idr-recovery-sei / analysis-reuse-level=5 / scale-factor=0 / refine-intra=0 / refine-inter=0 / refine-mv=0 / refine-ctu-distortion=0 / no-limit-sao / ctu-info=0 / no-lowpass-dct / refine-analysis-type=0 / copy-pic=1 / max-ausize-factor=1.0 / no-dynamic-refine / no-single-sei / no-hevc-aq / no-svt / qp-adaptation-range=1.00

my own build
22.6 seconds
cpuid=1049583 / frame-threads=3 / wpp / no-pmode / no-pme / no-psnr / no-ssim / log-level=2 / input-csp=1 / input-res=352x288 / interlace=0 / total-frames=2101 / level-idc=0 / high-tier=1 / uhd-bd=0 / ref=3 / no-allow-non-conformance / no-repeat-headers / annexb / no-aud / no-hrd / info / hash=0 / no-temporal-layers / open-gop / min-keyint=25 / keyint=250 / gop-lookahead=0 / bframes=4 / b-adapt=0 / b-pyramid / bframe-bias=0 / rc-lookahead=15 / lookahead-slices=0 / scenecut=40 / radl=0 / no-splice / no-intra-refresh / ctu=64 / min-cu-size=8 / no-rect / no-amp / max-tu-size=32 / tu-inter-depth=1 / tu-intra-depth=1 / limit-tu=0 / rdoq-level=0 / dynamic-rd=0.00 / no-ssim-rd / signhide / no-tskip / nr-intra=0 / nr-inter=0 / no-constrained-intra / strong-intra-smoothing / max-merge=2 / limit-refs=3 / no-limit-modes / me=1 / subme=2 / merange=57 / temporal-mvp / weightp / no-weightb / no-analyze-src-pics / deblock=0:0 / sao / no-sao-non-deblock / rd=2 / no-early-skip / rskip / fast-intra / no-tskip-fast / no-cu-lossless / no-b-intra / no-splitrd-skip / rdpenalty=0 / psy-rd=2.00 / psy-rdoq=0.00 / no-rd-refine / no-lossless / cbqpoffs=0 / crqpoffs=0 / rc=crf / crf=21.0 / qcomp=0.60 / qpstep=4 / stats-write=0 / stats-read=0 / ipratio=1.40 / pbratio=1.30 / aq-mode=2 / aq-strength=1.00 / cutree / zone-count=0 / no-strict-cbr / qg-size=32 / no-rc-grain / qpmax=69 / qpmin=0 / no-const-vbv / sar=255 / sar-width / : / sar-height=128:117 / overscan=0 / videoformat=5 / range=0 / colorprim=2 / transfer=2 / colormatrix=2 / chromaloc=0 / display-window=0 / max-cll=0,0 / min-luma=0 / max-luma=255 / log2-max-poc-lsb=8 / vui-timing-info / vui-hrd-info / slices=1 / no-opt-qp-pps / no-opt-ref-list-length-pps / no-multi-pass-opt-rps / scenecut-bias=0.05 / no-opt-cu-delta-qp / no-aq-motion / no-hdr / no-hdr-opt / no-dhdr10-opt / no-idr-recovery-sei / analysis-reuse-level=5 / scale-factor=0 / refine-intra=0 / refine-inter=0 / refine-mv=0 / refine-ctu-distortion=0 / no-limit-sao / ctu-info=0 / no-lowpass-dct / refine-analysis-type=0 / copy-pic=1 / max-ausize-factor=1.0 / no-dynamic-refine / no-single-sei / no-hevc-aq / qp-adaptation-range=1.00


the MABS build has some additional setting (numa-pools=8) but that did not affect the performance.

this was tested on a i7-3770k

LigH
22nd February 2019, 23:51
What you may not find here are default GNU C/C++ compiler options.

Please note that MABS scripts may set up some specific CFLAGS and CXXFLAGS (e.g. O2 or O3?). The interactive MinGW consoles should not ... so GCC / G++ defaults may apply. Except for the 32-bit build where I explicitly set CXXFLAGS with pretty generic options suitable for 32-bit code on any AMD64 capable CPU, minimally (see above).

I have no clue what I may do "right".

benwaggoner
23rd February 2019, 00:32
to be frank I would have been surprised using a different compiler to have much of an impact,...
It seems we've seen compilers make about a 10% difference from slowest to fastest. Which is kinda surprising to me given all the hand-tuned assembly that doesn't get compiled.

LigH
23rd February 2019, 00:38
With this amount, the only reason I could imagine is memory alignment...

FranceBB
23rd February 2019, 01:17
Since everyone was concerned about x64 platforms and nobody used x86, I tested it on a real x86 platform running Windows Server 2003 x86 with PAE and 16 GB of RAM.
The CPU is an old, dusty Intel Xeon 4c/8th running at 2.60GHz with instruction sets up to SSE4.2:

4/N.A) - x265 3.0_Au+7 - MABS compiled by LigH with media-autobuild_suite only (EXE only, no DLL)

It didn't even start. It refused to start due to missing kernel calls: GetNumaNodeProcessorMaskEx, InitializeConditionVariable, SetThreadGroupAffinity, SleepConditionVariableCS, WakeAllConditionVariable
No luck on Windows Server 2003, so it won't run on XP and its derivatives either.

3) - x265 3.0_Au+7 - compiled by LigH with custom build scripts to obtain libx265.dll too, running in interactive MinGW32 / MinGW64 shells

3.7fps/3.9fps

2) - x265 3.0_Au+7 - compiled with GCC9 (Preview) target SSE4.2

4.2fps/4.3fps

1) - x265 3.0_Au+7 - compiled with GCC8 target SSE4.2

4.7fps/4.8fps

Very basic low-complex Command line:
x265.exe --y4m - --dither --preset medium --level 5.0 --tune fastdecode --no-high-tier --ref 2 --rc-lookahead 3 -b 2 --profile main10 --bitrate 25000 --deblock -4:-4 --min-luma 64 --max-luma 940 --chromaloc 2 --range limited --videoformat component --colorprim bt709 --transfer bt709 --colormatrix bt709 --overscan show --no-open-gop --min-keyint 1 --keyint 24 --repeat-headers --rd 3 --vbv-maxrate 25000 --vbv-bufsize 25000 --asm=sse4.2 --wpp -o "\\VBOXSVR\Share_Windows_Linux\raw_video.hevc"

Lossless 16bit SD (UHD SDR downscaled) footage.


Anyway, I don't think the comparison is fair, 'cause LigH targeted pentium4, which means only SSE2 are supported.
In other words, I'm comparing SSE4.2 vs SSE2 and it's pretty clear that SSE4.2 have an advantage over SSE2.
As to GCC9, it seems that they changed something in the way -mtune behaves or maybe they changed something else; anyway, it produces an SSE4.2 build slower than the GCC8 SSE4.2 one.
It would be interesting to find out how ICC targeting SSE4.2 behaves on old Intel x86 systems (if Intel Parallel Studio can produce a Windows Server 2003 compatible binary).