kernel indexing question

Hello everybody, i’m working right now on a program that does color conversions form RGB space to CIE Lab space; the kernels performing the converion are

[codebox]global void RGBtoXYZ_GPU(float* d_r, float* d_g, float* d_b)


const unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

float red   = d_r[tid];

float green = d_g[tid];

float blue  = d_b[tid];

red   = (red   > 0.04045f) ? pow((red + 0.055f)/1.055f,   2.4f) : red / 12.92f;

green = (green > 0.04045f) ? pow((green + 0.055f)/1.055f, 2.4f) : green / 12.92f;

blue  = (blue  > 0.04045f) ? pow((blue + 0.055f)/1.055f,  2.4f) : blue / 12.92f;

d_r[tid] = red * 0.436052025f + green * 0.385081593f + blue * 0.143087414f; //X

d_g[tid] = red * 0.222491598f + green * 0.71688606f + blue * 0.060621486f; //Y

d_b[tid] = red * 0.013929122f + green * 0.097097002f + blue * 0.71418547f; //Z


global void XYZtoLab_GPU(float* d_X, float* d_Y, float* d_Z)


const unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

float fx, fy, fz;

//Reference white (D50)

float Xr = 0.964221f;

float Yr = 1.0f;

float Zr = 0.825211f;

float xr = d_X[tid]/Xr;

float yr = d_Y[tid]/Yr;

float zr = d_Z[tid]/Zr;

fx = (xr > 0.008856451f) ? pow(xr, 1.f/3.f) : (7.787f * xr) + 16.f / 116.f; 

fy = (yr > 0.008856451f) ? pow(yr, 1.f/3.f) : (7.787f * yr) + 16.f / 116.f;

fz = (zr > 0.008856451f) ? pow(zr, 1.f/3.f) : (7.787f * zr) + 16.f / 116.f;

d_X[tid] = (116.0f * fy) - 16; //L*

d_Y[tid] = 500 * (fx - fy);   //a

d_Z[tid] = 200 * (fy - fz);   //b


Then some normalization is done on the new color-values but that’s another story. The first version of the functions running these kernels was a test, and worked on a 512x512 rgb-image: the configuration for both kernels was a “linear” grid of 512 blocks and the block dimensions were 512x1x1 (<<< 512, 512 >>>). Obviously i’m now trying to extend the approach to bigger images and to map a pixel to a thread i need a different grid and block setup. The question is, how should i rearrange block and grid dimensions, and how does the thread in the kernel change? Let’s say i’m going to work on images much bigger than 512x512 in ppm format and i’m stuck using device emulation atm.

Please be patient since this is my first time programming in CUDA. Thank you very much!


You can run up to 65535 blocks just with a 1D grid, so you should be able to do images up to 5700x5700 (sqrt(65535*512)) without any changes to your current code. To go even larger, just use a 2D block which will only slightly increase the complexity of your indexing. You’ll probably want to add a “if (tid >= num_pixels) return;” so that you don’t have threads writing past the end of your allocated image.

Thank you very much, i followed your suggestion and everything worked just fine… the average size of the images i use to process is 2000x2000 so the linear grid solution just suits my needs.

And sorry for the double post, i just didn’t know wich was the right section to post my question in, i will not made this mistake again.

thanks again