Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion.

Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 

Go Back   Doom9's Forum > Video Encoding > High Efficiency Video Coding (HEVC)

Reply
 
Thread Tools Search this Thread Display Modes
Old 25th October 2017, 08:47   #1  |  Link
Winston_Smith_101
Registered User
 
Join Date: Sep 2016
Posts: 16
Why doesn't x265 use the GPU in encoding?

I know it's not a new topic. But, would it be thinkable that x265 could be supported by the immense computing power of modern graphics cards? I'm not a programmer and certainly not the first one to come up with this idea, but I am interested to know why this way is not tried?
Winston_Smith_101 is offline   Reply With Quote
Old 25th October 2017, 11:35   #2  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,806
Because algorithms used in x265 are not suitable for OpenCL.
Atak_Snajpera is offline   Reply With Quote
Old 25th October 2017, 17:02   #3  |  Link
easyfab
Registered User
 
Join Date: Jan 2002
Posts: 332
And IIRC data transfert is a bottleneck.

You can spend more time transfering data between CPU and GPU than doing the operation on CPU directly.
easyfab is offline   Reply With Quote
Old 25th October 2017, 19:40   #4  |  Link
mariush
Registered User
 
Join Date: Dec 2008
Posts: 589
I'm wondering if it would be too hard to reserve 2-3 GB of video card memory and fill it up with video frames and then do colorspace conversions, motion estimations and stuff like that on multiple frames simultaneously (ex instead of splitting one frame over hundreds of "mini-cores", preload the card with 100 frames and have a few cores working on each frame simultaneously... unless the random memory reads would kill the performance).

Unless I'm wrong, x265 does a lot of "early skips" and the defaults aren't as exhaustive or high quality as possible (in order to increase encoding speed) ... provided the user is willing to buffer loads of frames and has enough memory for that, wouldn't using video cards be beneficial?
mariush is offline   Reply With Quote
Old 25th October 2017, 22:23   #5  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,750
Quote:
Originally Posted by mariush View Post
I'm wondering if it would be too hard to reserve 2-3 GB of video card memory and fill it up with video frames and then do colorspace conversions, motion estimations and stuff like that on multiple frames simultaneously (ex instead of splitting one frame over hundreds of "mini-cores", preload the card with 100 frames and have a few cores working on each frame simultaneously... unless the random memory reads would kill the performance).
Absolutely; preproccessing filters are largely linear stuff (output of one filter is input to the next), and are very well suited for GPU. Encoding, particulary in advanced codecs like HEVC, have a whole lot of tight feedback loops where having stuff in the L1 cache is great. Just the round-trip latency from main memory to GPU and back again slows things down way too much. And lots of choices impact other choices in other parts of the same frame. Parallelizing that is really hard. One very fast core with very low memory latency can do a lot more than a single GPU "core." And while GPUs are good at processing a whole lot of bytes at once, AVX2 and AVX-512 offer similar capabilities in CPUs.

The latest Intel CPUs are really incredible devices for encoding HEVC.If anything, codecs are becoming less well suited for GPU versus CPU as they become more complex. MPEG-2 on GPU was pretty trivial, H.264 could fall short in qualty at lower bitrates, and HEVC on GPU simply hasn't ever demonstrated high qualiity with high compression efficiency.

I don't think GPUs are going to be viable in the foreseeable future. FPGA seems like the most viable alternative to CPU.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 25th October 2017, 23:24   #6  |  Link
RanmaCanada
Registered User
 
Join Date: May 2009
Posts: 328
There is a fork of it that does exactly this https://bitbucket.org/vovagubin/x265...r-cuda-encoder but people here who have compiled it and tried it have said it is no faster than using a CPU. I personally have no clue how to do such "magic" so I can't comment on that. I'm a 100% hardware person, so asking me to compile something you may as well be telling me to do rocket surgery.
RanmaCanada is offline   Reply With Quote
Old 25th October 2017, 23:31   #7  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,344
x264 got some OpenCL features, but from what I remember they barely help speed at all, and a more complex codec like HEVC will likely benefit even less. So overall, what benwaggoner said - modern codecs are too complex for that. You can of course do any pre-filtering on the GPU, if required, but thats really outside of the actual encoding process anyway.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders
nevcairiel is offline   Reply With Quote
Old 26th October 2017, 05:31   #8  |  Link
x265_Project
Guest
 
Posts: n/a
Quote:
Originally Posted by nevcairiel View Post
x264 got some OpenCL features, but from what I remember they barely help speed at all, and a more complex codec like HEVC will likely benefit even less. So overall, what benwaggoner said - modern codecs are too complex for that. You can of course do any pre-filtering on the GPU, if required, but thats really outside of the actual encoding process anyway.
Yes, and we wrote the GPU acceleration for x264. GPU computing (heterogeneous computing) was the core competency that MulticoreWare was founded on. We even make the framework that some semiconductor companies use as their OpenCL (or similar heterogeneous computing) developer's API.

Traditional "GPU Computing" involves offloading tasks to the GPU. The GPU needs to be able to complete these tasks faster than they would be completed in the CPU, so that other tasks that depend on the result of the first task don't have to wait for the first task to complete. GPU cores (shaders, stream processors, EUs, or whatever you want to call them) are slower than CPU cores, but typically there are many times more GPU cores available. So single-threaded work doesn't accelerate well on a GPU. You need highly parallelizeable functions. You need tasks that are sufficiently large (not small units of work). And you need tasks that don't have serial dependencies.

Video encoding involves LOTS of serial dependencies. The block of pixels I'm trying to encode now has to reference neighboring blocks above and to the left, and we are also searching for references in previously encoded frames. All of those blocks and frames must be finished encoding before we can start encoding the block we are trying to encode now. The units of work involved in each task are relatively small. If I want to offload that work to GPU that is sitting across a PCI bus, it will take 1 millisecond to send the work to the GPU, and 1 millisecond to get the result back (the latency of a PCI bus). 1 millisecond is an eternity... 2.5 million CPU clock cycles, typically.

When people talk about a GPU encoder, they may really be talking about a fixed function (hardware) encoder that is part of a graphics chip. But that's not software running in the GPU - that's a hardware encoder.

There are tricks we can use to work around some of these issues, but there are also many ways we can continue to accelerate x265, and we are looking at all of them.
  Reply With Quote
Old 26th October 2017, 08:59   #9  |  Link
Winston_Smith_101
Registered User
 
Join Date: Sep 2016
Posts: 16
Thank you so much for your detailed information! Now I understand better why the GPU is not used.
Winston_Smith_101 is offline   Reply With Quote
Old 26th October 2017, 23:15   #10  |  Link
hajj_3
Registered User
 
Join Date: Mar 2004
Posts: 1,120
Quote:
Originally Posted by x265_Project View Post
When people talk about a GPU encoder, they may really be talking about a fixed function (hardware) encoder that is part of a graphics chip. But that's not software running in the GPU - that's a hardware encoder.

There are tricks we can use to work around some of these issues, but there are also many ways we can continue to accelerate x265, and we are looking at all of them.
Is it not possible to let some tasks use the hardware encoder from intel/amd/nvidia and do the rest of the tasks on the cpu?
hajj_3 is offline   Reply With Quote
Old 27th October 2017, 01:04   #11  |  Link
Zebulon84
Registered User
 
Join Date: Apr 2015
Posts: 21
Quote:
Originally Posted by x265_Project View Post
If I want to offload that work to GPU that is sitting across a PCI bus, it will take 1 millisecond to send the work to the GPU, and 1 millisecond to get the result back (the latency of a PCI bus). 1 millisecond is an eternity... 2.5 million CPU clock cycles, typically.
Does the iGPU or APU have the same latency when calling GPU functions ?
Do they use the PCI interface despite being on the same chip ?
If so, is the newly announced Ryzen+Vega APU that use Infinity fabric to link CPU and GPU going to improve things ? Can this bring improvement on video encoding or tasks are still too small to benefit this lower latency ?
Zebulon84 is offline   Reply With Quote
Old 27th October 2017, 08:29   #12  |  Link
littlepox
Registered User
 
Join Date: Nov 2012
Posts: 218
Quote:
Originally Posted by hajj_3 View Post
Is it not possible to let some tasks use the hardware encoder from intel/amd/nvidia and do the rest of the tasks on the cpu?
The only possible way I can think of is that when you have multiple files to encode, do some with x265 and the rest using hardware counterpart.
littlepox is offline   Reply With Quote
Old 27th October 2017, 09:42   #13  |  Link
mariush
Registered User
 
Join Date: Dec 2008
Posts: 589
How about a scenario where for example user wants to apply a filter like resize on some content...?

For example the encoder can take in 4K footage and produce 1080p content. So the encoder starts using the cpu and is busy encoding at 5-10 fps, and while this happens it could go ahead a few hundred frames and "upload" 100 frames or so into the video card to have resizing done on the card instead of using the processor. So by going ahead, by the time the encoder processes using cpu enough frames to reach that offset, the video card should already be done with job and the resized frames should be already down in the regular computer memory.. and repeat the process with another batch of frames, or just upload frame by frame as frames are resized and transferred down into the regular memory in some "memory pool" / buffer

... not sure if there's any time saved considering you have to "upload" a few GB into the card, wait for the card to process the frames, "download" the frames ...

Would there be a noticeable quality difference in the resize algorithms done on CPU and on graphics card?

Or maybe this should not happen in the encoder but rather in the frame server or render that passes the frames to the encoder...
mariush is offline   Reply With Quote
Old 27th October 2017, 09:47   #14  |  Link
nevcairiel
Registered Developer
 
Join Date: Mar 2010
Location: Hamburg/Germany
Posts: 10,344
We've already talked about pre-processing/pre-filtering frames a few posts above, which would include resizing. Those are things GPUs are good at. But its not really the encoders job, you can do that before the frame gets to the actual encoder just fine.
__________________
LAV Filters - open source ffmpeg based media splitter and decoders
nevcairiel is offline   Reply With Quote
Old 27th October 2017, 13:02   #15  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,806
Quote:
Originally Posted by mariush View Post
How about a scenario where for example user wants to apply a filter like resize on some content...?

For example the encoder can take in 4K footage and produce 1080p content. So the encoder starts using the cpu and is busy encoding at 5-10 fps, and while this happens it could go ahead a few hundred frames and "upload" 100 frames or so into the video card to have resizing done on the card instead of using the processor. So by going ahead, by the time the encoder processes using cpu enough frames to reach that offset, the video card should already be done with job and the resized frames should be already down in the regular computer memory.. and repeat the process with another batch of frames, or just upload frame by frame as frames are resized and transferred down into the regular memory in some "memory pool" / buffer

... not sure if there's any time saved considering you have to "upload" a few GB into the card, wait for the card to process the frames, "download" the frames ...

Would there be a noticeable quality difference in the resize algorithms done on CPU and on graphics card?

Or maybe this should not happen in the encoder but rather in the frame server or render that passes the frames to the encoder...
Resizing is already very fast on CPU. I doubt that you will notice any difference in encoding speed. Other tasks like for example denoising (KNLMeansCL) are very slow even on high-end graphic cards like GTX1080 (~10 fps with 1920x1080 frame).
Atak_Snajpera is offline   Reply With Quote
Old 27th October 2017, 19:11   #16  |  Link
benwaggoner
Moderator
 
Join Date: Jan 2006
Location: Portland, OR
Posts: 4,750
Quote:
Originally Posted by Atak_Snajpera View Post
Resizing is already very fast on CPU. I doubt that you will notice any difference in encoding speed. Other tasks like for example denoising (KNLMeansCL) are very slow even on high-end graphic cards like GTX1080 (~10 fps with 1920x1080 frame).
Well, doing a linear light floating point scale from 4K down to 400x224 using an area scaler is still pretty slow. And that's helpful when doing big compression ratios with HDR content where you don't want any aliasing and want to preserve specular highlights as best as possible.

Good old integer 8-bit bicubic scaling is pretty trivial now, certainly.
__________________
Ben Waggoner
Principal Video Specialist, Amazon Prime Video

My Compression Book
benwaggoner is offline   Reply With Quote
Old 27th October 2017, 21:48   #17  |  Link
x265_Project
Guest
 
Posts: n/a
Quote:
Originally Posted by hajj_3 View Post
Is it not possible to let some tasks use the hardware encoder from intel/amd/nvidia and do the rest of the tasks on the cpu?
Yes, and we're working to develop such a hybrid, but right now it falls outside what we would push into x265 (as it involves a whole bunch of extra code to run the hardware encoder, get the analysis from the hardware encoder, format the analysis, such that x265 can use it as a starting point for a software encode). The trouble is that the quality of analysis coming from hardware encoders is not great. That will improve over time.
  Reply With Quote
Old 30th October 2017, 08:18   #18  |  Link
iwod
Registered User
 
Join Date: Apr 2002
Posts: 756
People will need to learn this Amdahl's law

https://en.wikipedia.org/wiki/Amdahl%27s_law

To simply put, Speedup is limited by the total time needed for the sequential (serial) part of the program. For 10 hours of computing, if we can parallelize 9 hours of computing and 1 hour cannot be parallelized, then our maximum speedup is limited to 10x.

And as explained, there isn't that much you can parallelize in complex codec like HEVC.

GPU isn't great at complex linear processing, it isn't going to be, and possibly never will be. But it will continue to do more simple processing at the same time, hence why GPU scales linearly will more transistors in place. ( Until it eventually hit bottleneck else where like memory bandwidth )

If doing more parallelization on encoder, I will assume having more CPU Cores will be better then doing it on the GPU. And you will need to thanks AMD for that.
iwod is offline   Reply With Quote
Old 4th November 2017, 03:23   #19  |  Link
soresu
Registered User
 
Join Date: May 2005
Location: Swansea, Wales, UK
Posts: 196
Still confuses me how SIMD on a CPU (AVX) is fine but GPU SIMD is bad, arent they both intrinsically parallel in nature?

Would a general video encoding accelerator ever be possible, or would processes such as RDO and trellis and so forth be too specific to each codec to generalise for future proofing?

Somewhat like hardware accelerated ray tracing this seems to be an area with little commercial development going forward right now in the age of AI/inference accelerators.
soresu is offline   Reply With Quote
Old 4th November 2017, 13:29   #20  |  Link
Atak_Snajpera
RipBot264 author
 
Atak_Snajpera's Avatar
 
Join Date: May 2006
Location: Poland
Posts: 7,806
Quote:
Still confuses me how SIMD on a CPU (AVX) is fine but GPU SIMD is bad, arent they both intrinsically parallel in nature?
When you use CPU instructions you do not have to do extra copying of data from one memory to another (RAM <-> VRAM). This kills performance according to x265_Project.

BTW. Why x265 does not use OpenCL for lookahead part like x264?
Atak_Snajpera is offline   Reply With Quote
Reply

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT +1. The time now is 22:21.


Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2024, vBulletin Solutions Inc.