Macro expansion bug? NVCC dying with internal error

spencer · April 11, 2007, 7:45pm

I am trying port the following code constructs to run on the GPU and nvcc is dying with internal errors

Here is the relevent code fragment from my hacked up version of x264 codec’s pixel.c

#define PIXEL_SAD_C( name, lx, ly ) \

DEVICE int name( uint8_t *pix1, int i_stride_pix1,  \

                 uint8_t *pix2, int i_stride_pix2 ) \

{                                                   \

    int i_sum = 0;                                  \

    int x, y;                                       \

    for( y = 0; y < ly; y++ )                       \

    {                                               \

        for( x = 0; x < lx; x++ )                   \

        {                                           \

            i_sum += abs( pix1[x] - pix2[x] );      \

        }                                           \

        pix1 += i_stride_pix1;                      \

        pix2 += i_stride_pix2;                      \

    }                                               \

    return i_sum;                                   \

}

PIXEL_SAD_C( pixel_sad_16x16, 16, 16 )

PIXEL_SAD_C( pixel_sad_16x8,  16,  8 )

PIXEL_SAD_C( pixel_sad_8x16,   8, 16 )

PIXEL_SAD_C( pixel_sad_8x8,    8,  8 )

PIXEL_SAD_C( pixel_sad_8x4,    8,  4 )

PIXEL_SAD_C( pixel_sad_4x8,    4,  8 )

PIXEL_SAD_C( pixel_sad_4x4,    4,  4 )

PIXEL_SAD_C( pixel_sad_4x2,    4,  2 )

PIXEL_SAD_C( pixel_sad_2x4,    2,  4 )

PIXEL_SAD_C( pixel_sad_2x2,    2,  2 )

typedef int  (*x264_pixel_cmp_t) ( uint8_t *, int, uint8_t *, int );

DEVICE x264_pixel_cmp_t pixel_sad[10] = { pixel_sad_16x16,

       pixel_sad_16x8,

       pixel_sad_8x8,

       pixel_sad_8x8,

       pixel_sad_8x4,

       pixel_sad_4x8,

       pixel_sad_4x4,

       pixel_sad_4x2,

       pixel_sad_2x4,

       pixel_sad_2x2 };

And later on I call it like this

       results[ tid ]= pixel_sad[i_pixel]( x_pixels, 

         FENC_STRIDE,

         y_pixels +  __umul24( mb_y, i_stride) + mb_x,

         i_stride) 

                 + p_cost_mvx[mb_x<<2] + p_cost_mvy[mb_y<<2];

So is this an illegal code construct? I tried to declare the expanded functions in the macro as device (define DEVICE device) as they are only called on the GPU in this context. There are no device function restrictions that obviously apply to this except maybe the restriction that _device functions cannot have their pointers taken.

I know I can expand the macro by hand and declare the functions individually but this way is more maintainable and expanding the macro won’t get around issues with pointers to device function (if that is the problem).

Suggestions?

Spencer

JaredHoberock · April 11, 2007, 8:03pm

Device function pointers are illegal. In 4.2.1.4 of cuda programming guide.

pyrtsa · April 11, 2007, 8:23pm

Hi,

I bet the problem is here, in trying to use a pointer to device function. The device functions are, to my knowledge, inlined by default, so there is no possibility to split your problem that way. To me it looks like your code could be rewritten without preprocessor defines so that the constant parameters lx and ly were actually parameters to a single function named

int pixel_sad(uint8_t * pix1, int i_stride_pix1, uint8_t * pix2, int i_stride_pix2, int lx, int ly);

with no performance penalty compared to your code.

Calling that function sequentially with different values of lx and ly, however, probably has to be done manually, because there is no for loop unrolling in the compiler v. 0.8.

/Pyry

spencer · April 11, 2007, 8:59pm

OK but that is a very poor way for the compiler to tell me that. :)

spencer · April 11, 2007, 9:04pm

Hi,

I bet the problem is here, in trying to use a pointer to device function. The device functions are, to my knowledge, inlined by default, so there is no possibility to split your problem that way. To me it looks like your code could be rewritten without preprocessor defines so that the constant parameters lx and ly were actually parameters to a single function named
int pixel_sad(uint8_t * pix1, int i_stride_pix1, uint8_t * pix2, int i_stride_pix2, int lx, int ly);
with no performance penalty compared to your code.

Calling that function sequentially with different values of lx and ly, however, probably has to be done manually, because there is no for loop unrolling in the compiler v. 0.8.

/Pyry

[snapback]182990[/snapback]

If there is no support for loop unrolling, you are quite correct. A more generalized implementation like what you suggested would work just as well though there might be a slight performance penalty because of 2 more parameters has to be pushed onto the stack and maybe more registers used vs a table look up.

Regards,

Spencer

pyrtsa · April 12, 2007, 7:25am

Again, because of implicit device function inlining, there should be no extra stack operations for literal constants as function parameters, if the compiler does things right. And even if there would be, often the global memory accesses will most probably still remain the bottleneck.

/Pyry

Topic		Replies	Views
macro function Is it impossible to use macro functions? CUDA Programming and Performance	9	5804	November 7, 2007
Suggestion for nvcc docs No function pointers CUDA Programming and Performance	0	1629	May 3, 2007
Question about a wired compiling error. CUDA Programming and Performance	2	2596	June 1, 2007
Assertion in exp_loadstore.cxx CUDA Programming and Performance	3	1100	May 28, 2011
Can not obtain device function pointer CUDA Programming and Performance	2	708	November 3, 2017
ERROR: EXTERNAL CALLS NOT SUPPORTED CUDA Programming and Performance	20	76199	June 24, 2012
Device function pointers: Is it possible to use them in a useful way? CUDA Programming and Performance	16	9330	May 20, 2020
Non-inlined device functions for compute capability 2.0? CUDA Programming and Performance	6	23856	January 21, 2011
__device__ function clarifications CUDA Programming and Performance	6	21646	December 10, 2008
Compilation error in arch sm_20 but compiles well with -arch sm_13 CUDA Programming and Performance	5	1005	February 3, 2011

Macro expansion bug? NVCC dying with internal error

Related topics