How can I optimize this?

Hello,

The below function is a device function. It has lot of calculations.

device
void fooX( unsigned char rgb[8][8], signed short int a0, signed short int a1, signed short int a2,
signed short int a3, signed short int a4, signed short int a5, signed short int a6,
signed short int a7, signed short int* result_x)
{
result_x[0] = (a0 * rgb[0][0] + a1 * rgb[0][1] + a2 * rgb[0][2] + a3 * rgb[0][3] +
a4 * rgb[0][4] + a5 * rgb[0][5] + a6 * rgb[0][6] + a7 * rgb[0][7]+ deltaPixel) >> shiftPixel;
result_x[1] = (a0 * rgb[1][0] + a1 * rgb[1][1] + a2 * rgb[1][2] + a3 * rgb[1][3] +
a4 * rgb[1][4] + a5 * rgb[1][5] + a6 * rgb[1][6] + a7 * rgb[1][7]+ deltaPixel) >> shiftPixel;
result_x[2] = (a0 * rgb[2][0] + a1 * rgb[2][1] + a2 * rgb[2][2] + a3 * rgb[2][3] +
a4 * rgb[2][4] + a5 * rgb[2][5] + a6 * rgb[2][6] + a7 * rgb[2][7]+ deltaPixel) >> shiftPixel;
result_x[3] = (a0 * rgb[3][0] + a1 * rgb[3][1] + a2 * rgb[3][2] + a3 * rgb[3][3] +
a4 * rgb[3][4] + a5 * rgb[3][5] + a6 * rgb[3][6] + a7 * rgb[3][7]+ deltaPixel) >> shiftPixel;
result_x[4] = (a0 * rgb[4][0] + a1 * rgb[4][1] + a2 * rgb[4][2] + a3 * rgb[4][3] +
a4 * rgb[4][4] + a5 * rgb[4][5] + a6 * rgb[4][6] + a7 * rgb[4][7]+ deltaPixel) >> shiftPixel;
result_x[5] = (a0 * rgb[5][0] + a1 * rgb[5][1] + a2 * rgb[5][2] + a3 * rgb[5][3] +
a4 * rgb[5][4] + a5 * rgb[5][5] + a6 * rgb[5][6] + a7 * rgb[5][7]+ deltaPixel) >> shiftPixel;
result_x[6] = (a0 * rgb[6][0] + a1 * rgb[6][1] + a2 * rgb[6][2] + a3 * rgb[6][3] +
a4 * rgb[6][4] + a5 * rgb[6][5] + a6 * rgb[6][6] + a7 * rgb[6][7]+ deltaPixel) >> shiftPixel;
result_x[7] = (a0 * rgb[7][0] + a1 * rgb[7][1] + a2 * rgb[7][2] + a3 * rgb[7][3] +
a4 * rgb[7][4] + a5 * rgb[7][5] + a6 * rgb[7][6] + a7 * rgb[7][7]+ deltaPixel) >> shiftPixel;
}

I’m calling above function from global like this…

global void fooGlob( … )
{
for( int i=0;i<500; ++i )
//call device function…
fooX( … );
}
}

How can I optimize above device function?
Thx for any help!!!

There are a couple of things you can try:

  • use fastmath as compiler switch

  • if all a0…a7 < 16 then its possible to parallelize two multiplies. using fastmath, integer muls have 24 bit - meaning you could do 2 muls with one - use 2* (8+4) bits, 8 bits is for rgb and 4 bits is for a0…7

  • is the rgb[…] array constant over the calculation ? then you can put it in shared mem

  • if you are computing some kind of image-filter where the rgbs are a square around the current position, better compute that in the pixel-shader - its usually faster as the texture cache will speed up accesses and you get some filtering “for free”.

What are the fastmath function?

Is it __fmul_rz(), _fdividef() and so on??

Or could you please type here fastmath functions?

sorry, I’m newbee to CUDA.