Hello all, I’m new in this forum, it’s my first post :rolleyes:
I’m new to CUDA programming too, and I’m get very interested in CUDA for image processing. My first use was for a simple Threshold from 24bits bitmap image.
My first .cu file is:
#ifndef _THRESHOLD_KERNEL_H_
#define _THRESHOLD_KERNEL_H_
// 24bits RGB
typedef struct
{
unsigned char B;
unsigned char G;
unsigned char R;
} RGB24;
#define BLOCK_DIM 16
///////////////////////////////////////////////////////////////////////////////////////////////////
// Threshold filter with 24bits input
///////////////////////////////////////////////////////////////////////////////////////////////////
// *idata -> Input/Output bitmap pointer
// width -> bitmap width
// height -> bitmap height
// stride -> bitmap binary width
// threshold -> threshold value, normalized from 0 to 25500 (so we avoid using floats)
// back -> new color for < threshold values
// object -> new color for > threshold values
///////////////////////////////////////////////////////////////////////////////////////////////////
extern "C" __global__ void thresholdGreyscale(unsigned char *idata, unsigned int width, unsigned int height, unsigned int stride, unsigned int threshold, RGB24 back, RGB24 object)
{
// Calculating true X, Y
unsigned int yIndex = (blockIdx.y * blockDim.y) + threadIdx.y;
unsigned int xIndex = (blockIdx.x * blockDim.x) + threadIdx.x;
// If X,Y are valid
if ((xIndex < width) && (yIndex < height))
{
// Calculating byte array index
int idx = (yIndex * stride) + (xIndex * 3);
// Reading data (pointer)
RGB24 *cor = (RGB24*)&idata[idx];
// Verify threshold limit
if ((cor->R * 30 + cor->G * 59 + cor->B * 11) > threshold)
{
// Write color for > threshold
*cor = object;
}
else
{
// Write color for < threshold
*cor = back;
}
}
}
#endif // _THRESHOLD_KERNEL_H_
*The INPUT is a byte-array, not a RGB24 array, because in memory a 24bits image/bitmap has a stride (how many bytes a line has, memory is blocks of 4bytes and bitmap is blocks of 3bytes).
**The INPUT is the same of OUTPUT.
This code is working, but is 2x slower than the CPU (Core2Quad, win7 64its, geforce8400gs, C# VisualStudio 2010 + CUDA.NET 3.0). Tested at a GTX285 too, but only a few ms faster than CPU. I tried out using shared BLOCK_DIM*BLOCK_DIM blocks, but still the same speed.
I know I’m doing something wrong, but don’t know what it is. Ideas?
Thanks,
Willian