Macro expansion bug? NVCC dying with internal error

I am trying port the following code constructs to run on the GPU and nvcc is dying with internal errors

Here is the relevent code fragment from my hacked up version of x264 codec’s pixel.c

#define PIXEL_SAD_C( name, lx, ly ) \

DEVICE int name( uint8_t *pix1, int i_stride_pix1,  \

                 uint8_t *pix2, int i_stride_pix2 ) \

{                                                   \

    int i_sum = 0;                                  \

    int x, y;                                       \

    for( y = 0; y < ly; y++ )                       \

    {                                               \

        for( x = 0; x < lx; x++ )                   \

        {                                           \

            i_sum += abs( pix1[x] - pix2[x] );      \

        }                                           \

        pix1 += i_stride_pix1;                      \

        pix2 += i_stride_pix2;                      \

    }                                               \

    return i_sum;                                   \


PIXEL_SAD_C( pixel_sad_16x16, 16, 16 )

PIXEL_SAD_C( pixel_sad_16x8,  16,  8 )

PIXEL_SAD_C( pixel_sad_8x16,   8, 16 )

PIXEL_SAD_C( pixel_sad_8x8,    8,  8 )

PIXEL_SAD_C( pixel_sad_8x4,    8,  4 )

PIXEL_SAD_C( pixel_sad_4x8,    4,  8 )

PIXEL_SAD_C( pixel_sad_4x4,    4,  4 )

PIXEL_SAD_C( pixel_sad_4x2,    4,  2 )

PIXEL_SAD_C( pixel_sad_2x4,    2,  4 )

PIXEL_SAD_C( pixel_sad_2x2,    2,  2 )

typedef int  (*x264_pixel_cmp_t) ( uint8_t *, int, uint8_t *, int );

DEVICE x264_pixel_cmp_t pixel_sad[10] = { pixel_sad_16x16,









       pixel_sad_2x2 };

And later on I call it like this

       results[ tid ]= pixel_sad[i_pixel]( x_pixels, 


         y_pixels +  __umul24( mb_y, i_stride) + mb_x,


                 + p_cost_mvx[mb_x<<2] + p_cost_mvy[mb_y<<2];

So is this an illegal code construct? I tried to declare the expanded functions in the macro as device (#define DEVICE device) as they are only called on the GPU in this context. There are no device function restrictions that obviously apply to this except maybe the restriction that _device functions cannot have their pointers taken.

I know I can expand the macro by hand and declare the functions individually but this way is more maintainable and expanding the macro won’t get around issues with pointers to device function (if that is the problem).



Device function pointers are illegal. In of cuda programming guide.


I bet the problem is here, in trying to use a pointer to device function. The device functions are, to my knowledge, inlined by default, so there is no possibility to split your problem that way. To me it looks like your code could be rewritten without preprocessor defines so that the constant parameters lx and ly were actually parameters to a single function named

int pixel_sad(uint8_t * pix1, int i_stride_pix1, uint8_t * pix2, int i_stride_pix2, int lx, int ly);

with no performance penalty compared to your code.

Calling that function sequentially with different values of lx and ly, however, probably has to be done manually, because there is no for loop unrolling in the compiler v. 0.8.


OK but that is a very poor way for the compiler to tell me that. :)

If there is no support for loop unrolling, you are quite correct. A more generalized implementation like what you suggested would work just as well though there might be a slight performance penalty because of 2 more parameters has to be pushed onto the stack and maybe more registers used vs a table look up.



Again, because of implicit device function inlining, there should be no extra stack operations for literal constants as function parameters, if the compiler does things right. And even if there would be, often the global memory accesses will most probably still remain the bottleneck.