Vapoursynth tonemap [Archive]

jpsdr

8th August 2018, 14:00

I'm implementing the '"standard" tonemap functions, using vapoursynth tonemap for the formula.

Reinhard has an exposure parameter, but i don't see it in the formula. Is it normal :confused:

ChaosKing

8th August 2018, 14:51

https://github.com/ifb/vapoursynth-tonemap#reinhard

jpsdr

8th August 2018, 15:05

Yes, but in the code (file tonemap.c), the exposure parameter is not used in the formula.

ChaosKing

8th August 2018, 15:10

Yeah, look here :D
https://github.com/ifb/vapoursynth-tonemap/issues/4

ifb

8th August 2018, 16:13

Yeah, look here :D
https://github.com/ifb/vapoursynth-tonemap/issues/4

Patches welcome. I don't actually use Reinhard so motivation is low. I should have some time to look at this week, though.

jpsdr

8th August 2018, 18:20

In fact, what i wanted to know is the correct formula (if there is such a thing) using exposure.

ifb

8th August 2018, 20:26

I don't know that there is a correct formula. Exposure isn't present in vf_tonemap.c, but Reinhard's paper does talk about applying an initial gain to the image.

I kinda feel like it's a hack that shouldn't be needed. It adds yet another setting to tweak, but I guess it's helpful?

I don't use Reinhard or any of the others much except for testing. I've ended up just using a 3D LUT.

jpsdr

9th August 2018, 15:05

A 3D LUT is still sane on 8 bits, but with 10 or 12 bits...

ifb

10th August 2018, 06:21

The new release (https://github.com/ifb/vapoursynth-tonemap/releases/tag/R2) on Github has working Reinhard exposure now.

A 3D LUT is still sane on 8 bits, but with 10 or 12 bits...

A 33-point LUT is faster for me (Skylake i5-6200U). Source is 1080p XAVC Intra 100.

source 210.25 YUV422P10
source->RGBS 130.05
3D LUT 63.44 RGBS
3D LUT 61.24 RGB48
3D LUT 61.44 RGB30
3D LUT 63.26 RGB24
Hable 57.04
Mobius 58.11
Reinhard 59.21

jpsdr

10th August 2018, 08:44

What do you call a 33-point LUT ? If input is 10 bits, the 3D LUT requires at least [1024x1024x1024]x2 (x3) => 6GB of memory ! (for 16 bits data output, x2 for RGPS data output).

videoh

10th August 2018, 13:26

3D LUTs are subsampled and output values are interpolated from the available input points. See here:

https://www.lightillusion.com/luts.html

jpsdr

10th August 2018, 14:47

Ah... Ok.

videoh

10th August 2018, 14:59

jpsdr

10th August 2018, 17:24

ifb

10th August 2018, 18:13

The theoretical performance benefit of using a 3D LUT versus implementing the underlying math is reduced by the need for accurate interpolation. In DGHDRtoSDR performance is achieved with CUDA parallelism, so I don't need to bother with 3D LUTs. madVR supports both modes (3D LUTs and "pixel shader math").
For my use case, it's not even about performance. If the "pure" math approach could achieve the same results as the LUT, I would use it if only to avoid the .cube file dependency. As it is, I didn't create the LUT and I don't need/want to tweak a multi-step filter chain to try and match it right now.

Cary Knoop

10th August 2018, 18:38

For technical conversions for luminance curves and gamut it does not make any sense to use a LUT much better to use a formula.
And those conversions are not very performance heavy.

videoh

10th August 2018, 19:56

Out of curiosity, (as i've absolutely no idea) what is the average parallelism level with CUDA ? Are we talking of things about 16/32, or are we more talking of things about 64/128/256 ?
I think it's probably defined by the GPU model, and can change according GPU, but with a standard actual card ? It's more like multiple thousands. My 1080 Ti has 3584 CUDA cores. With well-designed code occupancy can reach 100% but is typically closer to 75%.

videoh

10th August 2018, 19:59

And those conversions are not very performance heavy. Nonsense.

Cary Knoop

10th August 2018, 20:11

Nonsense.
It's a simple mathematical formula that won't significantly toll any decent GPU.

Out of curiosity what kind of operation would you not call performance heavy?

I would call performance heavy operations that need neighboring pixels or comparing values from prior or next frames, not single pixel/channel operations.

videoh

10th August 2018, 20:45

3D LUTs are used for all kinds of transforms, e.g., full HDR to SDR, which is the context of this thread. If you think the following process is not intensive, then I'll leave you to your own world:

Convert YCbCr to normalized float RGB.
Convert to linear float RGB (undo 2084).
Convert to a perceptually uniform color space.
Tonemap.
Perform gamut mapping using one of many different designs.
Convert back to linear float RGB.
Apply SDR gamma curve.
Convert back to YCbCr for output.

Tonemapping is one small piece of the puzzle.

The point of a LUT is that you can perform this process for each color value and store the results for lookup.

Cary Knoop

10th August 2018, 20:50

3D LUTs are used for all kinds of transforms, e.g., full HDR to SDR, which is the context of this thread. If you think the following process is not intensive, then I'll leave you to your own world:

Convert YCbCr to normalized float RGB.
Convert to linear float RGB (undo 2084).
Convert to a perceptually uniform color space.
Tonemap.
Perform gamut mapping using one of many different designs.
Convert back to linear float RGB.
Apply SDR gamma curve.
Convert back to YCbCr for output.

The point of a LUT is that you can perform this process for each color value and store the results for lookup.
I do those things in Resolve all the time and they hardly toll the GPU.

Not sure what you mean by convert to "perceptually uniform color space", converting to XYZ as an intermediate and back together with gamut mapping out of gamut values should do the job just fine.

Cary Knoop

10th August 2018, 20:58

As I said, I'll leave you to your own little world. 3D LUTs exist for a reason. It's not just that no-one is as clever as you.
That little world includes all professional color management tools which use mathematical transforms for gamma and gamut transforms not LUTs.

videoh

10th August 2018, 21:00

Enjoy your world. ;)

jpsdr's stuff is not GPU based so will you still advise him that LUTs are useless?

videoh

10th August 2018, 21:28

What is the achieved frame rate for 3840x2160?

TheFluff

10th August 2018, 23:29

What is the achieved frame rate for 3840x2160?

import vapoursynth as vs
core = vs.get_core()

# tested vs.RGBS, vs.RGBH, vs.RGB24 and vs.RGB48
clip = core.std.BlankClip(width=3840,height=2160,format=vs.RGBS, length=1000)
clip = core.timecube.Cube(clip, cube=r"D:\Encode\vscube\test.cube")

clip.set_output()

RGB24 (8-bit int): 211.35 fps
RGB48 (16-bit int): 127.69 fps
RGBH (16-bit/half precision float): 129.5 fps
RGBS (32-bit/single precision float): 56.65 fps

Intel i7-8700K @ 4.9 GHz, 16GB DDR4-3200. Using 12 threads (or, more precisely, 12 simultaneous requests in vspipe). That many threads is definitely suboptimal for RGBS though; I tried it with 2 threads and it went up to 65-ish FPS. Judging by a quick and dirty test, for RGBH it doesn't scale beyond 6 threads, while RGB24 can benefit from 8-10 threads but not from all 12. It's most likely memory/cache bound.

videoh

10th August 2018, 23:42

Thank you, TheFluff.

I will get the 3D LUT filter and run some experiments.

BTW, my DGHDRtoSDR() with DGSource() delivering YUV420P16 coincidentally also runs at 56 fps under Vapoursynth (i7 7700K + 1080 Ti). :)

videoh

11th August 2018, 00:18

Any idea why the vpy script would fail with:

vapoursynth.Error: LoadLibraryEx failed with code 87: update windows and try again

when loading vscube.dll?

I did retarget the Windows SDK to the one I had on my machine (10.0.16299.0) but it built OK. Would that cause it?

TheFluff

11th August 2018, 00:28

Maybe? This is Win7, no? I think the usual way of fixing that particular LoadLibraryEx misbehavior is installing either KB2533623 (https://www.microsoft.com/en-us/download/details.aspx?id=26767) or IE11 or both, I dunno.

videoh

11th August 2018, 00:45

OK, thanks, changed to full path for vscube.dll and it went away. I'm on latest Win10. Weird.

TheFluff

11th August 2018, 22:28

It was pointed out to me that I forgot to pass "keep=True" to blankclip. Since the script is memory bound, passing that parameter pretty much just doubles the performance, so the fps numbers quoted above are about half of what the filter itself is actually capable of.