Hello everybody, i’m working right now on a program that does color conversions form RGB space to CIE Lab space; the kernels performing the converion are

[codebox]**global** void RGBtoXYZ_GPU(float* d_r, float* d_g, float* d_b)

{

const unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

float red = d_r[tid];

float green = d_g[tid];

float blue = d_b[tid];

red = (red > 0.04045f) ? pow((red + 0.055f)/1.055f, 2.4f) : red / 12.92f;

green = (green > 0.04045f) ? pow((green + 0.055f)/1.055f, 2.4f) : green / 12.92f;

blue = (blue > 0.04045f) ? pow((blue + 0.055f)/1.055f, 2.4f) : blue / 12.92f;

d_r[tid] = red * 0.436052025f + green * 0.385081593f + blue * 0.143087414f; //X

d_g[tid] = red * 0.222491598f + green * 0.71688606f + blue * 0.060621486f; //Y

d_b[tid] = red * 0.013929122f + green * 0.097097002f + blue * 0.71418547f; //Z

}

**global** void XYZtoLab_GPU(float* d_X, float* d_Y, float* d_Z)

{

const unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;

float fx, fy, fz;

//Reference white (D50)

float Xr = 0.964221f;

float Yr = 1.0f;

float Zr = 0.825211f;

float xr = d_X[tid]/Xr;

float yr = d_Y[tid]/Yr;

float zr = d_Z[tid]/Zr;

fx = (xr > 0.008856451f) ? pow(xr, 1.f/3.f) : (7.787f * xr) + 16.f / 116.f;

fy = (yr > 0.008856451f) ? pow(yr, 1.f/3.f) : (7.787f * yr) + 16.f / 116.f;

fz = (zr > 0.008856451f) ? pow(zr, 1.f/3.f) : (7.787f * zr) + 16.f / 116.f;

d_X[tid] = (116.0f * fy) - 16; //L*

d_Y[tid] = 500 * (fx - fy); //a

d_Z[tid] = 200 * (fy - fz); //b

}

[/codebox]

Then some normalization is done on the new color-values but that’s another story. The first version of the functions running these kernels was a test, and worked on a 512x512 rgb-image: the configuration for both kernels was a “linear” grid of 512 blocks and the block dimensions were 512x1x1 (<<< 512, 512 >>>). Obviously i’m now trying to extend the approach to bigger images and to map a pixel to a thread i need a different grid and block setup. The question is, how should i rearrange block and grid dimensions, and how does the thread indexing in the kernel change? Let’s say i’m going to work on images much bigger than 512x512 in ppm format and i’m stuck using device emulation atm.

Please be patient since this is my first time programming in CUDA. Thank you very much!

A.