AviSynthShader + SuperRes - Page 2

Ghostlamer · 8th October 2015, 12:27

Is there any way to make it work with rfactor larger than 2.
My script:

Code:

XviD4PSPPluginsPath = "C:\Program Files (x86)\XviD4PSP 5\dlls\AviSynth\plugins\"

LoadPlugin(XviD4PSPPluginsPath+"nnedi3.dll")
LoadPlugin(XviD4PSPPluginsPath+"Shader.dll")

Import(XviD4PSPPluginsPath+"SuperRes.avsi")

SetMTMode(3,3)


AviSource("F:\2K\00.avi", audio=false, pixel_type="YV12")



SetMTMode(2)

SuperRes(2, 1, 0, false, """nnedi3_rpow2(rfactor=4, cshift="Spline16Resize", Threads=1)""")

Spline64Resize(2560,1380)

Without mtmode and rfactor 4 - stable working, but with 0.80fps with 6% cpu usage and 0% gpu, if i use rfactor=4 and setmtmode with more than 2 threads - crash, with 2 threads 1.40-1.60 fps and very low cpu and gpu usage.
Resolurion of original video 712x384.

MysteryX · 8th October 2015, 17:14

It crashes because you're going past the 2GB memory limit. The code will need to be optimized.

Performance can be improved by rewriting the functions to convert frames to/from float. The float/halffloat conversion can be done with a buffer instead of 1 by 1 which probably would increase performance. Having a HLSL Bicubic resize function also would help.

As far as memory usage, I'm not sure what can be done. Each DirectX 9 device is creating its own threads and managing its own memory. A DX9 device is created each time a Shader is called. If there are 8 shader calls within SuperRes, and 4 threads, then that's 32 DX9 devices.

You can analyze your script with AVSMeter. Using 1 pass instead of 2 also will increase performance. According to my tests, NNEDI3 works best with 2 threads.

Try this
SuperRes(1, .85, 0, false, """nnedi3_rpow2(rfactor=4, cshift="Spline16Resize", Threads=2)""")

Khanattila · 8th October 2015, 17:42

Quote:

Originally Posted by MysteryX

Fixed float-byte rounding to be more accurate by adding .5f before rounding. Slight performance improvement.

This results in the colors being slightly brighter, and the SuperRes Diff map to be more accurate which slightly improve its effectiveness.

As far as writing a native AviSynth version, I don't know if that would work. Originally, Shiandow was using the Lab colorspace which definitely requires half-float processing. He finally dropped it to use RGB Linear (not Gamma) colorspace. I doubt the YUV-RGB conversion could be avoided, and from my tests processing it with non-float data, the quality is considerably lower. This algorithm is very sensitive to details and must be processed with half-float precision. In that sense, perhaps native approaches wouldn't even be better than this. The GPU is much better at processing float data than the CPU.

Have you tried to normalize data in uint32_t? Without use floating-point numbers.
If working with processor is much faster.
unorm32 = (UINT32_MAX * (value - VALUE_MIN)) / (VALUE_MAX - VALUE_MIN)

If for example VALUE_MIN is 0 and VALUE_MAX is 255.
unorm32 = (4294967295 * value) / 255 = 16843009 * value
0 --> 0
1 --> 16843009
2 --> 33686018
...
255 --> UINT32_MAX

Groucho2004 · 8th October 2015, 18:16

Quote:

Originally Posted by Khanattila

Have you tried to normalize data in uint32_t? Without use floating-point numbers.
If working with processor is much faster.

Indeed. What's up with the obsession of some folks using floats lately? Even 64 bit (u)int is faster than 32 bit float.

MysteryX · 8th October 2015, 19:04

Quote:

Originally Posted by Khanattila

Have you tried to normalize data in uint32_t? Without use floating-point numbers.

I could do some more experiment with that, the shader processing can be done with uint data. However, the way color conversion is currently done creates overflow so that won't work just yet; there would be "stuff" to fix in the color conversion first. The HLSL color conversion code should avoid overflow, but I was seeing weird behaviors when I tried using it so that's not yet working.

If we can get color conversion to avoid overflow and work properly, then we could try processing with uint data and see what performance difference it makes.

MysteryX · 8th October 2015, 19:35

I'm leaving to China for 2 weeks and won't be playing with this. If someone wants to look into the code, you could look into
1. Getting the HLSL color conversion to work; or having CPU conversion that avoids overflows
2. Getting the shader to run with uint data (changing buffer format); which requires not having overflows
3. Optimizing the ConvertToFloat and ConvertFromFloat functions. Precision could have 3 values for ConvertToFloat, ConvertFromFloat and Shader functions: 1 (8-bit per channel), 2 (16-bit uint per channel) or 3 (16-bit float per channel)

luigizaninoni · 8th October 2015, 19:37

I must be missing something obvious, but have you any idea why is AvsPMod giving error:

LoadPlugin: unable to load "c:\users\admin\desktop\shaders\shader.dll",Module not found. Install missing library ?

shader.dll is definitely in that directory

My script:
LoadPlugin("C:\Users\admin\Desktop\Video\Staxrip\Applications\DGMPGDec\DGDecode.dll")
LoadPlugin("c:\users\admin\desktop\shaders\shader.dll")
MPEG2Source("C:\Users\admin\Desktop\10-07-05-40-01-mozzibrb-TELECOLOR temp files\10-07-05-40-01-mozzibrb-TELECOLOR.d2v",cpu=6,ipp=true,moderate_h=40,moderate_v=60,idct=5)
Crop(2, 2, -2, -2)
QTGMC(Preset="Slow")
SelectEven()
SuperRes(2, 0.85, 0, true, """nnedi3_rpow2(rfactor=2, cshift="Spline16Resize", Threads=2)""", "C:\users\admin\desktop\shaders\")

Ghostlamer · 8th October 2015, 20:14

Quote:

It crashes because you're going past the 2GB memory limit. The code will need to be optimized.

Im using virtualdub and when the crash occurs, vdub process eat only 800-1100 mb.
Script working with setmtmode(3 and 5) instead of 2 (with more than 2 threads), but the output video - buggy (mode 3, 4-5 fps), low perfomance (mode 5, 0.5-0.8 fps).

Quote:

Try this
SuperRes(1, .85, 0, false, """nnedi3_rpow2(rfactor=4, cshift="Spline16Resize", Threads=2)""")

Thanks for the advice.

MysteryX · 8th October 2015, 20:41

btw, if anyone wants to play with the code, it's pretty simple, but you need
- DirectX SDK
- Visual Studio (I'm making very little use of C++, it could be adapted to standard C with little changes)

GitHub allows you to download the source code, make your own changes and upload your contributions to the code. It takes some time to learn how use but then is very useful for collaborative projects. TortoiseGit makes it much easier to use.

Groucho2004 · 8th October 2015, 20:51

Quote:

Originally Posted by luigizaninoni

I must be missing something obvious, but have you any idea why is AvsPMod giving error:

LoadPlugin: unable to load "c:\users\admin\desktop\shaders\shader.dll",Module not found. Install missing library ?

shader.dll is definitely in that directory

Which OS are you using? If you're using Vista or above, use Dependency Walker to find out what's missing, possibly some of the DX stuff.

Ghostlamer · 8th October 2015, 20:59

MysteryX, Triple quotes can be bypassed?, why I ask?, there is a script mt_pipeline http://forum.doom9.org/showthread.php?t=163281 , it allows to bypass the 2gb limit, but it also uses triple quotation marks and conflicts with supereres (but works very well with many others), i just not avisynth guru, do not know much.

MysteryX · 8th October 2015, 21:34

I tried applying the 4GB patch to AVSMeter.exe, and it didn't work. The flag D3DXCONSTTABLE_LARGEADDRESSAWARE must also be added within the DX9 source code to make that work. Not sure why it hasn't worked yet.

Groucho2004 · 8th October 2015, 21:47

Quote:

Originally Posted by MysteryX

I tried applying the 4GB patch to AVSMeter.exe, and it didn't work.

LARGEADDRESSAWARE is one of linker options I use for the 32 Bit binary. No need to patch.

luigizaninoni · 8th October 2015, 22:14

Quote:

Originally Posted by Groucho2004

Which OS are you using? If you're using Vista or above, use Dependency Walker to find out what's missing, possibly some of the DX stuff.

Problem solved. d3dx9_43.dll was actually missing. Thank you very much for your kind advice

MysteryX · 8th October 2015, 22:39

Quote:

Originally Posted by luigizaninoni

Problem solved. d3dx9_43.dll was actually missing. Thank you very much for your kind advice

How come was that file missing? Isn't it a system file that "should" already be there?

I just did an experiment with processing frames with int data instead of float. When initializing the device, I replaced the format D3DFMT_A16B16G16R16F by D3DFMT_A16B16G16R16. The performance is slightly faster but not that much. We might save some more in the data conversion. Obviously, with this test, the image was corrupt because it was processing half-float data as if it was uint, but it looked better than I would have expected.

There is definitely a bottleneck somewhere and it isn't the half-float shader processing.

In terms of numbers, I'm using this script

Quote:

SetMTMode(3,4)
AviSource("Preview.avi", audio=false, pixel_type="YV12")
SetMTMode(2)
SuperRes(2, .42, 0, true, """nnedi3_rpow2(2, cshift="Spline16Resize", Threads=2)""")
Distributor()

With D3DFMT_A16B16G16R16 or D3DFMT_A16B16G16R16F, I get almost exactly the same numbers: 12fps @ 53% CPU. Memory usage also is the same.

One thing I found out is that creating a .def file (such as AVSMeter.def) with "STACKSIZE 512KB" in it slightly increases performance.

I also just did another quick test: removing the half-float conversions. It still calculates in float but then converts into short. Performance went way up from 12fps to 15-16fps.

Shiandow · 8th October 2015, 23:32

Quote:

Originally Posted by Khanattila

Have you tried to normalize data in uint32_t? Without use floating-point numbers.
If working with processor is much faster.
unorm32 = (UINT32_MAX * (value - VALUE_MIN)) / (VALUE_MAX - VALUE_MIN)

If for example VALUE_MIN is 0 and VALUE_MAX is 255.
unorm32 = (4294967295 * value) / 255 = 16843009 * value
0 --> 0
1 --> 16843009
2 --> 33686018
...
255 --> UINT32_MAX

Quote:

Originally Posted by Groucho2004

Indeed. What's up with the obsession of some folks using floats lately? Even 64 bit (u)int is faster than 32 bit float.

Well ,the original SuperRes code (designed for MPDN) used 16 bit uint for most of the processing. It does store an intermediate results in float, but that conversion is handled by the GPU itself. I'm not even sure if that part is necessary, signed ints would probably work just as well.

However the shaders will still use floats (single precision) internally. And as far as I know GPUs aren't that good at integer (or fixed point) arithmetic, but maybe that's changed.

Khanattila · 8th October 2015, 23:43

Quote:

Originally Posted by Shiandow

Well ,the original SuperRes code (designed for MPDN) used 16 bit uint for most of the processing. It does store an intermediate results in float, but that conversion is handled by the GPU itself. I'm not even sure if that part is necessary, signed ints would probably work just as well.

However the shaders will still use floats (single precision) internally. And as far as I know GPUs aren't that good at integer (or fixed point) arithmetic, but maybe that's changed.

GPU are TERRIBLE with integer. But it have a fast internal conversion from integer to float.

Like KNLMeansCL, this is the way forward:
Read Integer Buffer --> GPU internal Conversion to normalized float --> Processing float --> GPU internal Conversion to integer --> Write to Integer Buffer.

Anyway, in this case it is better not to use float rather than converting by CPU.

MysteryX · 9th October 2015, 01:06

I wouldn't be surprised if the DX9 function to convert half-float data is delegated to the GPU then, and that's what the buffer-processing function is for. If that's the case, then right now I'm sending commands to the GPU one by one. If I batch them into a buffer to be processed all at once, then performance would probably be MUCH better. Worth a try!

MysteryX · 9th October 2015, 02:51

I edited ConvertToFloat to use a buffer for half-float conversions. It still calculates as float (which could be optimized by calculating int instead), stores all data into a large float buffer, converts the whole frame at once, then copy back into the frame. ConvertFromFloat doesn't have those changes yet.

That change brought the performance up from 12fps to 14.5fps.

MysteryX · 9th October 2015, 03:36

ConvertToFloat and ConvertFromFloat are now both using a buffer for half-float conversion. Performance is now 18.5fps instead of 12fps. It could be further improved by calculating int data instead of float.

With this optimization, the CPU usage is also now higher, even with only 4 threads, so the whole script is running considerably faster.

Edit: I adapted ConvertToFloat to calculate color conversion with INT instead of FLOAT, performance further increased.

8th October 2015, 12:27	#21 \| Link
Ghostlamer Registered User Join Date: Aug 2009 Posts: 25	Is there any way to make it work with rfactor larger than 2. My script: Code: XviD4PSPPluginsPath = "C:\Program Files (x86)\XviD4PSP 5\dlls\AviSynth\plugins\" LoadPlugin(XviD4PSPPluginsPath+"nnedi3.dll") LoadPlugin(XviD4PSPPluginsPath+"Shader.dll") Import(XviD4PSPPluginsPath+"SuperRes.avsi") SetMTMode(3,3) AviSource("F:\2K\00.avi", audio=false, pixel_type="YV12") SetMTMode(2) SuperRes(2, 1, 0, false, """nnedi3_rpow2(rfactor=4, cshift="Spline16Resize", Threads=1)""") Spline64Resize(2560,1380) Without mtmode and rfactor 4 - stable working, but with 0.80fps with 6% cpu usage and 0% gpu, if i use rfactor=4 and setmtmode with more than 2 threads - crash, with 2 threads 1.40-1.60 fps and very low cpu and gpu usage. Resolurion of original video 712x384. Last edited by Ghostlamer; 8th October 2015 at 12:48.

8th October 2015, 17:14	#22 \| Link
MysteryX Soul Architect Join Date: Apr 2014 Posts: 2,559	It crashes because you're going past the 2GB memory limit. The code will need to be optimized. Performance can be improved by rewriting the functions to convert frames to/from float. The float/halffloat conversion can be done with a buffer instead of 1 by 1 which probably would increase performance. Having a HLSL Bicubic resize function also would help. As far as memory usage, I'm not sure what can be done. Each DirectX 9 device is creating its own threads and managing its own memory. A DX9 device is created each time a Shader is called. If there are 8 shader calls within SuperRes, and 4 threads, then that's 32 DX9 devices. You can analyze your script with AVSMeter. Using 1 pass instead of 2 also will increase performance. According to my tests, NNEDI3 works best with 2 threads. Try this SuperRes(1, .85, 0, false, """nnedi3_rpow2(rfactor=4, cshift="Spline16Resize", Threads=2)""") __________________ FrameRateConverter \| AvisynthShader \| AvsFilterNet \| Natural Grounding Player with Yin Media Encoder, 432hz Player, Powerliminals Player and Audio Video Muxer Last edited by MysteryX; 8th October 2015 at 17:17.

8th October 2015, 19:35	#26 \| Link
MysteryX Soul Architect Join Date: Apr 2014 Posts: 2,559	I'm leaving to China for 2 weeks and won't be playing with this. If someone wants to look into the code, you could look into 1. Getting the HLSL color conversion to work; or having CPU conversion that avoids overflows 2. Getting the shader to run with uint data (changing buffer format); which requires not having overflows 3. Optimizing the ConvertToFloat and ConvertFromFloat functions. Precision could have 3 values for ConvertToFloat, ConvertFromFloat and Shader functions: 1 (8-bit per channel), 2 (16-bit uint per channel) or 3 (16-bit float per channel) __________________ FrameRateConverter \| AvisynthShader \| AvsFilterNet \| Natural Grounding Player with Yin Media Encoder, 432hz Player, Powerliminals Player and Audio Video Muxer

8th October 2015, 20:41	#29 \| Link
MysteryX Soul Architect Join Date: Apr 2014 Posts: 2,559	btw, if anyone wants to play with the code, it's pretty simple, but you need - DirectX SDK - Visual Studio (I'm making very little use of C++, it could be adapted to standard C with little changes) GitHub allows you to download the source code, make your own changes and upload your contributions to the code. It takes some time to learn how use but then is very useful for collaborative projects. TortoiseGit makes it much easier to use. __________________ FrameRateConverter \| AvisynthShader \| AvsFilterNet \| Natural Grounding Player with Yin Media Encoder, 432hz Player, Powerliminals Player and Audio Video Muxer

8th October 2015, 20:59	#31 \| Link
Ghostlamer Registered User Join Date: Aug 2009 Posts: 25	MysteryX, Triple quotes can be bypassed?, why I ask?, there is a script mt_pipeline http://forum.doom9.org/showthread.php?t=163281 , it allows to bypass the 2gb limit, but it also uses triple quotation marks and conflicts with supereres (but works very well with many others), i just not avisynth guru, do not know much. Last edited by Ghostlamer; 8th October 2015 at 21:03.

8th October 2015, 19:37	#27 \| Link
luigizaninoni Registered User Join Date: Apr 2015 Posts: 163	I must be missing something obvious, but have you any idea why is AvsPMod giving error: LoadPlugin: unable to load "c:\users\admin\desktop\shaders\shader.dll",Module not found. Install missing library ? shader.dll is definitely in that directory My script: LoadPlugin("C:\Users\admin\Desktop\Video\Staxrip\Applications\DGMPGDec\DGDecode.dll") LoadPlugin("c:\users\admin\desktop\shaders\shader.dll") MPEG2Source("C:\Users\admin\Desktop\10-07-05-40-01-mozzibrb-TELECOLOR temp files\10-07-05-40-01-mozzibrb-TELECOLOR.d2v",cpu=6,ipp=true,moderate_h=40,moderate_v=60,idct=5) Crop(2, 2, -2, -2) QTGMC(Preset="Slow") SelectEven() SuperRes(2, 0.85, 0, true, """nnedi3_rpow2(rfactor=2, cshift="Spline16Resize", Threads=2)""", "C:\users\admin\desktop\shaders\")

8th October 2015, 21:34	#32 \| Link
MysteryX Soul Architect Join Date: Apr 2014 Posts: 2,559	I tried applying the 4GB patch to AVSMeter.exe, and it didn't work. The flag D3DXCONSTTABLE_LARGEADDRESSAWARE must also be added within the DX9 source code to make that work. Not sure why it hasn't worked yet. __________________ FrameRateConverter \| AvisynthShader \| AvsFilterNet \| Natural Grounding Player with Yin Media Encoder, 432hz Player, Powerliminals Player and Audio Video Muxer

9th October 2015, 01:06	#38 \| Link
MysteryX Soul Architect Join Date: Apr 2014 Posts: 2,559	I wouldn't be surprised if the DX9 function to convert half-float data is delegated to the GPU then, and that's what the buffer-processing function is for. If that's the case, then right now I'm sending commands to the GPU one by one. If I batch them into a buffer to be processed all at once, then performance would probably be MUCH better. Worth a try! __________________ FrameRateConverter \| AvisynthShader \| AvsFilterNet \| Natural Grounding Player with Yin Media Encoder, 432hz Player, Powerliminals Player and Audio Video Muxer

9th October 2015, 02:51	#39 \| Link
MysteryX Soul Architect Join Date: Apr 2014 Posts: 2,559	I edited ConvertToFloat to use a buffer for half-float conversions. It still calculates as float (which could be optimized by calculating int instead), stores all data into a large float buffer, converts the whole frame at once, then copy back into the frame. ConvertFromFloat doesn't have those changes yet. That change brought the performance up from 12fps to 14.5fps. __________________ FrameRateConverter \| AvisynthShader \| AvsFilterNet \| Natural Grounding Player with Yin Media Encoder, 432hz Player, Powerliminals Player and Audio Video Muxer

9th October 2015, 03:36	#40 \| Link
MysteryX Soul Architect Join Date: Apr 2014 Posts: 2,559	ConvertToFloat and ConvertFromFloat are now both using a buffer for half-float conversion. Performance is now 18.5fps instead of 12fps. It could be further improved by calculating int data instead of float. With this optimization, the CPU usage is also now higher, even with only 4 threads, so the whole script is running considerably faster. Edit: I adapted ConvertToFloat to calculate color conversion with INT instead of FLOAT, performance further increased. __________________ FrameRateConverter \| AvisynthShader \| AvsFilterNet \| Natural Grounding Player with Yin Media Encoder, 432hz Player, Powerliminals Player and Audio Video Muxer Last edited by MysteryX; 9th October 2015 at 05:41.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search
Display Modes
Linear Mode Switch to Hybrid Mode Switch to Threaded Mode