Why doesn't x265 use the GPU in encoding? [Archive]

Winston_Smith_101

25th October 2017, 08:47

I know it's not a new topic. But, would it be thinkable that x265 could be supported by the immense computing power of modern graphics cards? I'm not a programmer and certainly not the first one to come up with this idea, but I am interested to know why this way is not tried?

Atak_Snajpera

25th October 2017, 11:35

Because algorithms used in x265 are not suitable for OpenCL.

easyfab

25th October 2017, 17:02

And IIRC data transfert is a bottleneck.

You can spend more time transfering data between CPU and GPU than doing the operation on CPU directly.

mariush

25th October 2017, 19:40

I'm wondering if it would be too hard to reserve 2-3 GB of video card memory and fill it up with video frames and then do colorspace conversions, motion estimations and stuff like that on multiple frames simultaneously (ex instead of splitting one frame over hundreds of "mini-cores", preload the card with 100 frames and have a few cores working on each frame simultaneously... unless the random memory reads would kill the performance).

Unless I'm wrong, x265 does a lot of "early skips" and the defaults aren't as exhaustive or high quality as possible (in order to increase encoding speed) ... provided the user is willing to buffer loads of frames and has enough memory for that, wouldn't using video cards be beneficial?

benwaggoner

25th October 2017, 22:23

I'm wondering if it would be too hard to reserve 2-3 GB of video card memory and fill it up with video frames and then do colorspace conversions, motion estimations and stuff like that on multiple frames simultaneously (ex instead of splitting one frame over hundreds of "mini-cores", preload the card with 100 frames and have a few cores working on each frame simultaneously... unless the random memory reads would kill the performance).

Absolutely; preproccessing filters are largely linear stuff (output of one filter is input to the next), and are very well suited for GPU. Encoding, particulary in advanced codecs like HEVC, have a whole lot of tight feedback loops where having stuff in the L1 cache is great. Just the round-trip latency from main memory to GPU and back again slows things down way too much. And lots of choices impact other choices in other parts of the same frame. Parallelizing that is really hard. One very fast core with very low memory latency can do a lot more than a single GPU "core." And while GPUs are good at processing a whole lot of bytes at once, AVX2 and AVX-512 offer similar capabilities in CPUs.

The latest Intel CPUs are really incredible devices for encoding HEVC.If anything, codecs are becoming less well suited for GPU versus CPU as they become more complex. MPEG-2 on GPU was pretty trivial, H.264 could fall short in qualty at lower bitrates, and HEVC on GPU simply hasn't ever demonstrated high qualiity with high compression efficiency.

I don't think GPUs are going to be viable in the foreseeable future. FPGA seems like the most viable alternative to CPU.

RanmaCanada

25th October 2017, 23:24

There is a fork of it that does exactly this https://bitbucket.org/vovagubin/x265-hevc-opencl-or-cuda-encoder but people here who have compiled it and tried it have said it is no faster than using a CPU. I personally have no clue how to do such "magic" so I can't comment on that. I'm a 100% hardware person, so asking me to compile something you may as well be telling me to do rocket surgery.

nevcairiel

25th October 2017, 23:31

x264 got some OpenCL features, but from what I remember they barely help speed at all, and a more complex codec like HEVC will likely benefit even less. So overall, what benwaggoner said - modern codecs are too complex for that. You can of course do any pre-filtering on the GPU, if required, but thats really outside of the actual encoding process anyway.

x265_Project

26th October 2017, 05:31

x264 got some OpenCL features, but from what I remember they barely help speed at all, and a more complex codec like HEVC will likely benefit even less. So overall, what benwaggoner said - modern codecs are too complex for that. You can of course do any pre-filtering on the GPU, if required, but thats really outside of the actual encoding process anyway.
Yes, and we wrote the GPU acceleration for x264. GPU computing (heterogeneous computing) was the core competency that MulticoreWare was founded on. We even make the framework that some semiconductor companies use as their OpenCL (or similar heterogeneous computing) developer's API.

Traditional "GPU Computing" involves offloading tasks to the GPU. The GPU needs to be able to complete these tasks faster than they would be completed in the CPU, so that other tasks that depend on the result of the first task don't have to wait for the first task to complete. GPU cores (shaders, stream processors, EUs, or whatever you want to call them) are slower than CPU cores, but typically there are many times more GPU cores available. So single-threaded work doesn't accelerate well on a GPU. You need highly parallelizeable functions. You need tasks that are sufficiently large (not small units of work). And you need tasks that don't have serial dependencies.

Video encoding involves LOTS of serial dependencies. The block of pixels I'm trying to encode now has to reference neighboring blocks above and to the left, and we are also searching for references in previously encoded frames. All of those blocks and frames must be finished encoding before we can start encoding the block we are trying to encode now. The units of work involved in each task are relatively small. If I want to offload that work to GPU that is sitting across a PCI bus, it will take 1 millisecond to send the work to the GPU, and 1 millisecond to get the result back (the latency of a PCI bus). 1 millisecond is an eternity... 2.5 million CPU clock cycles, typically.

When people talk about a GPU encoder, they may really be talking about a fixed function (hardware) encoder that is part of a graphics chip. But that's not software running in the GPU - that's a hardware encoder.

There are tricks we can use to work around some of these issues, but there are also many ways we can continue to accelerate x265, and we are looking at all of them.

Winston_Smith_101

26th October 2017, 08:59

Thank you so much for your detailed information! Now I understand better why the GPU is not used.

hajj_3

26th October 2017, 23:15

When people talk about a GPU encoder, they may really be talking about a fixed function (hardware) encoder that is part of a graphics chip. But that's not software running in the GPU - that's a hardware encoder.

There are tricks we can use to work around some of these issues, but there are also many ways we can continue to accelerate x265, and we are looking at all of them.

Is it not possible to let some tasks use the hardware encoder from intel/amd/nvidia and do the rest of the tasks on the cpu?

Zebulon84

27th October 2017, 01:04

If I want to offload that work to GPU that is sitting across a PCI bus, it will take 1 millisecond to send the work to the GPU, and 1 millisecond to get the result back (the latency of a PCI bus). 1 millisecond is an eternity... 2.5 million CPU clock cycles, typically.

Does the iGPU or APU have the same latency when calling GPU functions ?
Do they use the PCI interface despite being on the same chip ?
If so, is the newly announced Ryzen+Vega APU that use Infinity fabric to link CPU and GPU going to improve things ? Can this bring improvement on video encoding or tasks are still too small to benefit this lower latency ?

littlepox

27th October 2017, 08:29

Is it not possible to let some tasks use the hardware encoder from intel/amd/nvidia and do the rest of the tasks on the cpu?

The only possible way I can think of is that when you have multiple files to encode, do some with x265 and the rest using hardware counterpart.

mariush

27th October 2017, 09:42

How about a scenario where for example user wants to apply a filter like resize on some content...?

For example the encoder can take in 4K footage and produce 1080p content. So the encoder starts using the cpu and is busy encoding at 5-10 fps, and while this happens it could go ahead a few hundred frames and "upload" 100 frames or so into the video card to have resizing done on the card instead of using the processor. So by going ahead, by the time the encoder processes using cpu enough frames to reach that offset, the video card should already be done with job and the resized frames should be already down in the regular computer memory.. and repeat the process with another batch of frames, or just upload frame by frame as frames are resized and transferred down into the regular memory in some "memory pool" / buffer

... not sure if there's any time saved considering you have to "upload" a few GB into the card, wait for the card to process the frames, "download" the frames ...

Would there be a noticeable quality difference in the resize algorithms done on CPU and on graphics card?

Or maybe this should not happen in the encoder but rather in the frame server or render that passes the frames to the encoder...

nevcairiel

27th October 2017, 09:47

We've already talked about pre-processing/pre-filtering frames a few posts above, which would include resizing. Those are things GPUs are good at. But its not really the encoders job, you can do that before the frame gets to the actual encoder just fine.

Atak_Snajpera

27th October 2017, 13:02

How about a scenario where for example user wants to apply a filter like resize on some content...?

For example the encoder can take in 4K footage and produce 1080p content. So the encoder starts using the cpu and is busy encoding at 5-10 fps, and while this happens it could go ahead a few hundred frames and "upload" 100 frames or so into the video card to have resizing done on the card instead of using the processor. So by going ahead, by the time the encoder processes using cpu enough frames to reach that offset, the video card should already be done with job and the resized frames should be already down in the regular computer memory.. and repeat the process with another batch of frames, or just upload frame by frame as frames are resized and transferred down into the regular memory in some "memory pool" / buffer

... not sure if there's any time saved considering you have to "upload" a few GB into the card, wait for the card to process the frames, "download" the frames ...

Would there be a noticeable quality difference in the resize algorithms done on CPU and on graphics card?

Or maybe this should not happen in the encoder but rather in the frame server or render that passes the frames to the encoder...

Resizing is already very fast on CPU. I doubt that you will notice any difference in encoding speed. Other tasks like for example denoising (KNLMeansCL) are very slow even on high-end graphic cards like GTX1080 (~10 fps with 1920x1080 frame).

benwaggoner

27th October 2017, 19:11

Resizing is already very fast on CPU. I doubt that you will notice any difference in encoding speed. Other tasks like for example denoising (KNLMeansCL) are very slow even on high-end graphic cards like GTX1080 (~10 fps with 1920x1080 frame).
Well, doing a linear light floating point scale from 4K down to 400x224 using an area scaler is still pretty slow. And that's helpful when doing big compression ratios with HDR content where you don't want any aliasing and want to preserve specular highlights as best as possible.

Good old integer 8-bit bicubic scaling is pretty trivial now, certainly.

x265_Project

27th October 2017, 21:48

Is it not possible to let some tasks use the hardware encoder from intel/amd/nvidia and do the rest of the tasks on the cpu?

Yes, and we're working to develop such a hybrid, but right now it falls outside what we would push into x265 (as it involves a whole bunch of extra code to run the hardware encoder, get the analysis from the hardware encoder, format the analysis, such that x265 can use it as a starting point for a software encode). The trouble is that the quality of analysis coming from hardware encoders is not great. That will improve over time.

iwod

30th October 2017, 08:18

People will need to learn this Amdahl's law

https://en.wikipedia.org/wiki/Amdahl%27s_law

To simply put, Speedup is limited by the total time needed for the sequential (serial) part of the program. For 10 hours of computing, if we can parallelize 9 hours of computing and 1 hour cannot be parallelized, then our maximum speedup is limited to 10x.

And as explained, there isn't that much you can parallelize in complex codec like HEVC.

GPU isn't great at complex linear processing, it isn't going to be, and possibly never will be. But it will continue to do more simple processing at the same time, hence why GPU scales linearly will more transistors in place. ( Until it eventually hit bottleneck else where like memory bandwidth )

If doing more parallelization on encoder, I will assume having more CPU Cores will be better then doing it on the GPU. And you will need to thanks AMD for that.

soresu

4th November 2017, 03:23

Still confuses me how SIMD on a CPU (AVX) is fine but GPU SIMD is bad, arent they both intrinsically parallel in nature?

Would a general video encoding accelerator ever be possible, or would processes such as RDO and trellis and so forth be too specific to each codec to generalise for future proofing?

Somewhat like hardware accelerated ray tracing this seems to be an area with little commercial development going forward right now in the age of AI/inference accelerators.

Atak_Snajpera

4th November 2017, 13:29

Still confuses me how SIMD on a CPU (AVX) is fine but GPU SIMD is bad, arent they both intrinsically parallel in nature?
When you use CPU instructions you do not have to do extra copying of data from one memory to another (RAM <-> VRAM). This kills performance according to x265_Project.

BTW. Why x265 does not use OpenCL for lookahead part like x264?

x265_Project

5th November 2017, 18:29

Still confuses me how SIMD on a CPU (AVX) is fine but GPU SIMD is bad, arent they both intrinsically parallel in nature?GPU processors have very simple design, with reduced instruction sets. They run at 2 to 3x slower clock speeds than CPU processor cores. However, there can be many more of them (thousands) in a single chip. For example... https://instinct.radeon.com/en/product/mi/radeon-instinct-mi25/ has 4096 stream processors, and https://www.nvidia.com/en-us/design-visualization/products/titan-xp/ has 3084 CUDA cores. Compare this with https://ark.intel.com/products/120496/Intel-Xeon-Platinum-8180-Processor-38_5M-Cache-2_50-GHz, a $10,000 server processor with 28 cores. Each CPU core is much more sophisticated, occupying a far larger area of the processor die. CPU cores have deep instruction pipelines, with advanced instruction queues and branch prediction. CPU cores have multiple integer and floating point registers (where the work gets done), supporting SIMD instruction sets that can perform operations on as much as 512 bits of data in a single instruction (64 bytes, or 32 sixteen bit samples when you're encoding 10 or 12 bit video).

Would a general video encoding accelerator ever be possible, or would processes such as RDO and trellis and so forth be too specific to each codec to generalise for future proofing?
Yes, and no. If you look at the fundamental operations in a video encoder, there are many operations that take many clock cycles. For example, we frequently need to calculate the distortion between a block of source pixels and the candidate encoded block. Sometimes we calculate the Mean Squared Error (MSE), and sometimes we calculate the Sum of Absolute Differences (SAD), or the Sum of Absolute Transformed Differences (SATD). There are many variations of this operation for different block sizes (8x8, 16x16, 8x16, 16x8, etc.)... but we can often combine the 8x8 calculation multiple times to get the result for larger block sizes and shapes. These calculations take many instructions to complete (you can count them in our SIMD optimized kernels in the x265 source code). So, for example, if a processor had a single instruction that could return the SATD of two 8x8 blocks of pixels in one or two clock cycles, this would accelerate x265. Processor manufacturers understand these things, but they usually provide instructions that operate at a slightly lower level of granularity, so that a wider range of software algorithms can benefit. If you really want to understand this, check out an online course on SIMD optimization, and then take a look at our SIMD optimized kernels in x265 (x265\source\common\x86).

Blue_MiSfit

11th November 2017, 05:04

Thanks for all the awesome, super detailed feedback, x265_Project! Always great to see you guys active on here.

D3X

13th February 2022, 23:17

There is a fork of it that does exactly this https://bitbucket.org/vovagubin/x265-hevc-opencl-or-cuda-encoder but people here who have compiled it and tried it have said it is no faster than using a CPU. I personally have no clue how to do such "magic" so I can't comment on that. I'm a 100% hardware person, so asking me to compile something you may as well be telling me to do rocket surgery.

Where one earth can I find a DL for this?
It's like it vanished from the face and fate of the world ^^
Thanks