Welcome to Doom9's Forum, THE in-place to be for everyone interested in DVD conversion. Before you start posting please read the forum rules. By posting to this forum you agree to abide by the rules.

 Doom9's Forum HBD/OPTIMIZATION Request :)
 Register FAQ Calendar Search Today's Posts Mark Forums Read

 26th January 2023, 03:17 #2  |  Link kedautinh12 Registered User   Join Date: Jan 2018 Posts: 1,891 Ideas from Asd-g https://github.com/Asd-g/AviSynthPlu...ent-1404461580
 26th January 2023, 17:53 #4  |  Link Ceppo Registered User   Join Date: Feb 2016 Location: Nonsense land Posts: 331 Thanks DTL! I will update with your tips in the evening! I will post the update tomorrow __________________ CQTGMC/CTools I come from nonsense land. I usually post under the effect of alchool and I don't think before writing, so don't get it personal, I didn't mean to.
 26th January 2023, 18:05 #5  |  Link DTL Registered User   Join Date: Jul 2018 Posts: 979 Also for integer samples Code: i = (int)(i / 16.0f + 0.5f); may be replaced with integer shift without converting to float (and slow float division that any good compiler will replace with multiplication to float constant I hope) and back: Code: i = i >> 4; Same processing in the SIMD functions in the future - make logical shift 4 bit to the right (if using unsigned ints or arithmetic sign-extending shift to the right if value can be negative). If you make sort of gaussian blur - the sum looks like always be positive so you can use unsigned integers (and have +1 bit to keep from low bits overflow in dense SIMD calculations). For 8bit unsigned samples up to 255 Code: i = srcp[j] + srcp[x] * 2 + srcp[k]; i += srcc[j] * 2 + srcc[x] * 4 + srcc[k] * 2; i += srcn[j] + srcn[x] * 2 + srcn[k]; sum looks like may have possible max of (1+2+1+2+4+2+1+2+1)*255=4080 so you can use 16bit unsigned integers as accumulator for the total block convolution in SIMD processing. Float calculation you will need only in the float32 samples function. Also +0.5f for better rounding not needed for floats. For float32 samples version it is much better to use something like Code: const float my1div16=1.0f / 16.0f; i *= my1div16; or may be better Code: i *= 0.0625f; // 1.0f / 16.0f; Or pray to gods of C compiler to do same thing for you in the optimized release build if you write direct division to float constant like i /= 16.0f. Last edited by DTL; 26th January 2023 at 18:38.
 27th January 2023, 06:05 #9  |  Link kedautinh12 Registered User   Join Date: Jan 2018 Posts: 1,891 You can try other build for speed clang, gcc,... And after CShapern HBD, optimize. Can you try with CTools HBD, optimize?? Last edited by kedautinh12; 27th January 2023 at 06:08.
 27th January 2023, 07:16 #10  |  Link Ceppo Registered User   Join Date: Feb 2016 Location: Nonsense land Posts: 331 Thanks is the whole point of this filter __________________ CQTGMC/CTools I come from nonsense land. I usually post under the effect of alchool and I don't think before writing, so don't get it personal, I didn't mean to.
 27th January 2023, 10:18 #12  |  Link StainlessS HeartlessS Usurer     Join Date: Dec 2009 Location: Over the rainbow Posts: 10,806 From the little I remember about C++ (about 4 weeks of study back in 1996), when you fully define a member function inside a class declaration, it is a hint for the complier that you want it in-lined {the compiler is not compelled to inline it, and may only do it if a reasonably small function}. However, C++ has changed a bit since the 90's. Thanks for the thread guys, is quite interesting, and maybe a potential sticky contender for optimising code for HBD and SIMD. EDIT: "sticky contender", Ideally, it would have been a more simple filter like a simple Average() [or similar], to better concentrate on the optimisation. EDIT: Or more simple Invert() style filter. __________________ I sometimes post sober. StainlessS@MediaFire ::: AND/OR ::: StainlessS@SendSpace "Some infinities are bigger than other infinities", but how many of them are infinitely bigger ??? Last edited by StainlessS; 27th January 2023 at 10:32.
 27th January 2023, 10:45 #13  |  Link Reel.Deel Registered User   Join Date: Mar 2012 Location: Texas Posts: 1,646 I'm not a programmer but I thought I'd share this. While working on the avs+ docs, I had to scroll back through the commit history. From there you can see how the code changed for all of the filters. For example, here is the first change pinterf did to the blur/sharpen filters to support 16-bit: https://github.com/AviSynth/AviSynth...8f4d95b931c7f8. Here's the "luma" mode of Histogram when HBD was added: https://github.com/AviSynth/AviSynth...4eaaab12918e81. All of the internals filters have these changes to look at, starting when they were 8-bit only.
 27th January 2023, 21:08 #14  |  Link Ceppo Registered User   Join Date: Feb 2016 Location: Nonsense land Posts: 331 This SEEMS to work, but of course it's probably not how you are supposed to do it; BTW, thanks for all the info, when I figured out this HBD stuff, I will treasure them. Code: #include #include class InvertNeg : public GenericVideoFilter { public: InvertNeg(PClip _child, IScriptEnvironment* env) : GenericVideoFilter(_child) { } PVideoFrame __stdcall GetFrame(int n, IScriptEnvironment* env) { PVideoFrame dst = env->NewVideoFrame(vi); PVideoFrame src = child->GetFrame(n, env); auto c = (1 << vi.BitsPerComponent()) - 1; int planes[] = { PLANAR_Y, PLANAR_V, PLANAR_U }; for (int p = 0; p < 3; p++) { auto srcp = src->GetReadPtr(planes[p]); auto dstp = dst->GetWritePtr(planes[p]); auto height = src->GetHeight(planes[p]) * vi.ComponentSize(); auto row_size = src->GetRowSize(planes[p]) / vi.ComponentSize(); auto src_pitch = src->GetPitch(planes[p]) / vi.ComponentSize(); auto dst_pitch = dst->GetPitch(planes[p]) / vi.ComponentSize(); for (int y = 0; y < height; y++) { for (int x = 0; x < row_size; x++) { dstp[x] = srcp[x] ^ c; } srcp += src_pitch; dstp += dst_pitch; } } return dst; } }; AVSValue __cdecl Create_InvertNeg(AVSValue args, void* user_data, IScriptEnvironment* env) { return new InvertNeg(args[0].AsClip(), env); } const AVS_Linkage* AVS_linkage = 0; extern "C" __declspec(dllexport) const char* __stdcall AvisynthPluginInit3(IScriptEnvironment * env, const AVS_Linkage* const vectors) { AVS_linkage = vectors; env->AddFunction("InvertNeg", "c", Create_InvertNeg, 0); return "InvertNeg sample plugin"; } __________________ CQTGMC/CTools I come from nonsense land. I usually post under the effect of alchool and I don't think before writing, so don't get it personal, I didn't mean to.
 27th January 2023, 22:32 #15  |  Link DTL Registered User   Join Date: Jul 2018 Posts: 979 Code: height = src->GetHeight(planes[p]) * vi.ComponentSize(); this looks like error. Number of lines in a frame and rows in a storage buffer do not depend on bitdepth. Only measured in 8bit bytes row length in memory and pitch. With this line loop will run out of buffer very far (and will meet hardware memory protection error when run to the next 4kB memory page typically). It may temporarily work with very small frame sizes but cause memory corruption after actual buffer length. As for 'auto' pointers types - I not sure if compiler really knows how many types do you need to support and may be not compile 'real' 3 different versions of functions. May be it will take only types of functions of unsigned char from AVS headers. If you want to use 'templating' you need to declare template and 3 real functions of types unsigned char, unsigned short and float somewhere. As I remember first we declare 'template function' like https://github.com/DTL2020/mvtools/b...olation.h#L137 in the header. With pixel_t param as our bitdepth. Next make function implementation - https://github.com/DTL2020/mvtools/b...tion.cpp#L2791 using pixel_t as param of data type. It is example of 'universal HBD' C-function for all data types. See how src and dst accessed via pixel_t type: Code: pixel_t* pctDst = reinterpret_cast(pDst); const pixel_t* pSrc; pSrc = reinterpret_cast(_pSrc) ... pctDst[j * nDstPitch + i] = (pixel_t)fOut; with default declared _pSrc and pDst types as unsigned char only. In your 'one for all' HBD C-function you can use pixel_t type as conditional assignment of types of variables like https://github.com/DTL2020/mvtools/b...grainN.cpp#L98 And next declare 3 real functions of 3 types to compile - https://github.com/DTL2020/mvtools/b...tion.cpp#L4766 . So compiler will make 3 real copies of function to use. Next at the class constructor you select the required type of function depends on 'pixelsize': https://github.com/DTL2020/mvtools/b...VPlane.cpp#L86 . When AVS construct filtergraph it call class constructors and provide bitdepth to use. So at this point you can select the required version of function to use. And call function by its pointer at processing time: https://github.com/DTL2020/mvtools/b...Plane.cpp#L693 Last edited by DTL; 27th January 2023 at 23:17.
 28th January 2023, 01:01 #16  |  Link Ceppo Registered User   Join Date: Feb 2016 Location: Nonsense land Posts: 331 This works with all bits depth Code: #include #include template void Invert(const unsigned char* _srcp, unsigned char* _dstp, int src_pitch, int dst_pitch, int height, int row_size, int bits) { pixel_t* dstp = reinterpret_cast(_dstp); const pixel_t* srcp = reinterpret_cast(_srcp); if (bits == 32) { for (int y = 0; y < height; y++) { for (int x = 0; x < row_size; x++) { dstp[x] = 1.0f - srcp[x]; } dstp += dst_pitch; srcp += src_pitch; } } else { int MAX = (1 << bits) - 1; for (int y = 0; y < height; y++) { for (int x = 0; x < row_size; x++) { dstp[x] = MAX - srcp[x]; } dstp += dst_pitch; srcp += src_pitch; } } } template void Invert(const unsigned char* _srcp, unsigned char* _dstp, int src_pitch, int dst_pitch, int height, int row_size, int bits); template void Invert(const unsigned char* _srcp, unsigned char* _dstp, int src_pitch, int dst_pitch, int height, int row_size, int bits); template void Invert(const unsigned char* _srcp, unsigned char* _dstp, int src_pitch, int dst_pitch, int height, int row_size, int bits); class InvertNeg : public GenericVideoFilter { public: InvertNeg(PClip _child, IScriptEnvironment* env) : GenericVideoFilter(_child) { } PVideoFrame __stdcall GetFrame(int n, IScriptEnvironment* env) { PVideoFrame dst = env->NewVideoFrame(vi); PVideoFrame src = child->GetFrame(n, env); auto srcp = src->GetReadPtr(PLANAR_Y); auto dstp = dst->GetWritePtr(PLANAR_Y); auto height = src->GetHeight(PLANAR_Y); auto row_size = src->GetRowSize(PLANAR_Y) / vi.ComponentSize(); auto src_pitch = src->GetPitch(PLANAR_Y) / vi.ComponentSize(); auto dst_pitch = dst->GetPitch(PLANAR_Y) / vi.ComponentSize(); if (vi.ComponentSize() == 1) { Invert(srcp, dstp, src_pitch, dst_pitch, height, row_size, vi.BitsPerComponent()); } if (vi.ComponentSize() == 2) { Invert(srcp, dstp, src_pitch, dst_pitch, height, row_size, vi.BitsPerComponent()); } if (vi.ComponentSize() == 4) { Invert(srcp, dstp, src_pitch, dst_pitch, height, row_size, vi.BitsPerComponent()); } return dst; } }; AVSValue __cdecl Create_InvertNeg(AVSValue args, void* user_data, IScriptEnvironment* env) { return new InvertNeg(args[0].AsClip(), env); } const AVS_Linkage* AVS_linkage = 0; extern "C" __declspec(dllexport) const char* __stdcall AvisynthPluginInit3(IScriptEnvironment * env, const AVS_Linkage* const vectors) { AVS_linkage = vectors; env->AddFunction("InvertNeg", "c", Create_InvertNeg, 0); return "InvertNeg sample plugin"; } __________________ CQTGMC/CTools I come from nonsense land. I usually post under the effect of alchool and I don't think before writing, so don't get it personal, I didn't mean to.
 28th January 2023, 21:37 #20  |  Link Ceppo Registered User   Join Date: Feb 2016 Location: Nonsense land Posts: 331 Thanks for the tips; I noticed here https://www.laruence.com/sse/# that for AVX and SSE, some functions have not a _m256i, _m128i one, how do you handle 16bit and 8 bit cases? __________________ CQTGMC/CTools I come from nonsense land. I usually post under the effect of alchool and I don't think before writing, so don't get it personal, I didn't mean to.