Doom9's Forum - View Single Post

DTL · 12th February 2023, 13:06

Moved from 'getting latest' thread: About better performance of MPEG encoders (including all x26x projects) at 'big' and 'large' architectures like AVX2 and AVX512 using multi-blocks processing program redesign from single block processing.
About I-frames only example:
I got C-sources from https://github.com/ShiftMediaProject...master/encoder and it is built with VisualStudio 2017. Other versions (including jpsdr) looks like not compatible with MSVS.
After some profiling I see some significant time is in the intra_satd_x9_4x4() function and track its call stack to the all macroblocks walking through:
It is loop in the encoder.c file: https://github.com/ShiftMediaProject...ncoder.c#L2812 . It walks through all frame macroblocks one by one by rows and columns (using MB number advancing mb_xy = i_mb_x + i_mb_y * h->mb.i_mb_width; where i_mb_x is current x-pos in MBs array and i_mb_y is current y-pos).
So practical 'workunit' size for each 'macro loop' pass is 1 macroblock only. If macroblock is of 16x16 size it mean the total CPU core executing this thread have only workunit size of 16x16 (lets 10bit proc and 16bit values per sample) - 16x16x(2 bytes per sample) = 512 bytes. Too few for CPU capable of processing up to kilobytes workunits.

So to make this part of encoder faster we need to re-design this 'macro loop' and all downstream called functions to process several macroblocks in a single pass. But it not very easy task and also if not all macroblocks in a 'group pass' are processed equally it need some more branching (like fallback to single macroblock proc if its processing is not equal to others).

The final 'macroloop' advancing at https://github.com/ShiftMediaProject...ncoder.c#L3068 will be not simple
i_mb_x++; (for progressive encode mode)
+1 advancing but
i_mb_x+=num_macroblocks_per_pass;

But program re-design to this simple 'internal parallelling to use SIMD' may take lots of time.

More close to reality of fixing example: At the processing of 16x16 macroblock with partititions down to 4x4 it split macroblock to 4x4 blocks of 4x4 and check some predictors for each 4x4 block. So it is the much smaller loop of https://github.com/ShiftMediaProject...analyse.c#L924

Code:

                    for( ; *predict_mode >= 0; predict_mode++ )
                    {
                        int i_satd;
                        int i_mode = *predict_mode;

                        if( h->mb.b_lossless )
                            x264_predict_lossless_4x4( h, p_dst_by, 0, idx, i_mode );
                        else
                            h->predict_4x4[i_mode]( p_dst_by );

                        i_satd = h->pixf.mbcmp[PIXEL_4x4]( p_src_by, FENC_STRIDE, p_dst_by, FDEC_STRIDE );
                        if( i_pred_mode == x264_mb_pred_mode4x4_fix(i_mode) )
                        {
                            i_satd -= lambda * 3;
                            if( i_satd <= 0 )
                            {
                                i_best = i_satd;
                                a->i_predict4x4[idx] = i_mode;
                                break;
                            }
                        }

                        COPY2_IF_LT( i_best, i_satd, a->i_predict4x4[idx], i_mode );
                    }

where h->pixf.mbcmp[PIXEL_4x4]( p_src_by, FENC_STRIDE, p_dst_by, FDEC_STRIDE ); is call to single-block of SATD(SAD depending on options ?) of 2 4x4 blocks (assembly function typically for each SIMD family). Count of loop spins is typically number of non-negative members in predict_mode pointed vector (about 3 or 4).

When running of the very old architectures like SSE(2) the 2 of 4x4 16bit blocks for SATD calculation takes 64 bytes to load and at x86 SSE2 with 8 only 128 bit (16 bytes) SIMD register file of 128 bytes total size it takes about half of register file and close to no space left for immediate values if try to load 2 pairs of blocks. So this implementation is 'internally limited' to SSE2 32bit build target architecture. It is optimal for speed at that architecture because at each iteration it can break by condition i_satd <= 0 and skip some predictors and save some time.

At the larger architectures it is possible to process more SATD computing of 4x4 16bit pairs blocks in single SIMD pass. So this program block may be rearranged to more SATD computing per single pass using new multi-block SATD computing SIMD function and the cycle may be changed to processing groups of predictors (typically to single pass when using up to 4 predictors) and after single SIMD function call analyse for minimal i_satd value from vector of SATD values and select minimal (also can be attempted to do with SIMD min member of vector instruction _mm_minpos_epu16() from SSE 4.1 set if SATD not great than 16bit unsigned value - unfortunately no 32bit copy of this nice instructon even at AVX512 set). But this new program block need to be guarded by 'architecture' if() block like only for AVX2 and x64 or larger and it make total program text bigger and harder to understand (and debug and support and so on).

12th February 2023, 13:06	#2130 \| Link
DTL Registered User Join Date: Jul 2018 Posts: 1,075	Moved from 'getting latest' thread: About better performance of MPEG encoders (including all x26x projects) at 'big' and 'large' architectures like AVX2 and AVX512 using multi-blocks processing program redesign from single block processing. About I-frames only example: I got C-sources from https://github.com/ShiftMediaProject...master/encoder and it is built with VisualStudio 2017. Other versions (including jpsdr) looks like not compatible with MSVS. After some profiling I see some significant time is in the intra_satd_x9_4x4() function and track its call stack to the all macroblocks walking through: It is loop in the encoder.c file: https://github.com/ShiftMediaProject...ncoder.c#L2812 . It walks through all frame macroblocks one by one by rows and columns (using MB number advancing mb_xy = i_mb_x + i_mb_y * h->mb.i_mb_width; where i_mb_x is current x-pos in MBs array and i_mb_y is current y-pos). So practical 'workunit' size for each 'macro loop' pass is 1 macroblock only. If macroblock is of 16x16 size it mean the total CPU core executing this thread have only workunit size of 16x16 (lets 10bit proc and 16bit values per sample) - 16x16x(2 bytes per sample) = 512 bytes. Too few for CPU capable of processing up to kilobytes workunits. So to make this part of encoder faster we need to re-design this 'macro loop' and all downstream called functions to process several macroblocks in a single pass. But it not very easy task and also if not all macroblocks in a 'group pass' are processed equally it need some more branching (like fallback to single macroblock proc if its processing is not equal to others). The final 'macroloop' advancing at https://github.com/ShiftMediaProject...ncoder.c#L3068 will be not simple i_mb_x++; (for progressive encode mode) +1 advancing but i_mb_x+=num_macroblocks_per_pass; But program re-design to this simple 'internal parallelling to use SIMD' may take lots of time. More close to reality of fixing example: At the processing of 16x16 macroblock with partititions down to 4x4 it split macroblock to 4x4 blocks of 4x4 and check some predictors for each 4x4 block. So it is the much smaller loop of https://github.com/ShiftMediaProject...analyse.c#L924 Code: for( ; predict_mode >= 0; predict_mode++ ) { int i_satd; int i_mode = predict_mode; if( h->mb.b_lossless ) x264_predict_lossless_4x4( h, p_dst_by, 0, idx, i_mode ); else h->predict_4x4[i_mode]( p_dst_by ); i_satd = h->pixf.mbcmp[PIXEL_4x4]( p_src_by, FENC_STRIDE, p_dst_by, FDEC_STRIDE ); if( i_pred_mode == x264_mb_pred_mode4x4_fix(i_mode) ) { i_satd -= lambda * 3; if( i_satd <= 0 ) { i_best = i_satd; a->i_predict4x4[idx] = i_mode; break; } } COPY2_IF_LT( i_best, i_satd, a->i_predict4x4[idx], i_mode ); } where h->pixf.mbcmp[PIXEL_4x4]( p_src_by, FENC_STRIDE, p_dst_by, FDEC_STRIDE ); is call to single-block of SATD(SAD depending on options ?) of 2 4x4 blocks (assembly function typically for each SIMD family). Count of loop spins is typically number of non-negative members in predict_mode pointed vector (about 3 or 4). When running of the very old architectures like SSE(2) the 2 of 4x4 16bit blocks for SATD calculation takes 64 bytes to load and at x86 SSE2 with 8 only 128 bit (16 bytes) SIMD register file of 128 bytes total size it takes about half of register file and close to no space left for immediate values if try to load 2 pairs of blocks. So this implementation is 'internally limited' to SSE2 32bit build target architecture. It is optimal for speed at that architecture because at each iteration it can break by condition i_satd <= 0 and skip some predictors and save some time. At the larger architectures it is possible to process more SATD computing of 4x4 16bit pairs blocks in single SIMD pass. So this program block may be rearranged to more SATD computing per single pass using new multi-block SATD computing SIMD function and the cycle may be changed to processing groups of predictors (typically to single pass when using up to 4 predictors) and after single SIMD function call analyse for minimal i_satd value from vector of SATD values and select minimal (also can be attempted to do with SIMD min member of vector instruction _mm_minpos_epu16() from SSE 4.1 set if SATD not great than 16bit unsigned value - unfortunately no 32bit copy of this nice instructon even at AVX512 set). But this new program block need to be guarded by 'architecture' if() block like only for AVX2 and x64 or larger and it make total program text bigger and harder to understand (and debug and support and so on).