kernel memory allocation tenets

Is there a set of general rules as to how to allocate memory for the kernel?

Right now, I am simply taking a buffer of floats (RGB pixel, with given height and width), copying it to device memory, run an empty kernel on the buffer, and then deallocate. I am getting cudaError for invoking a kernel. The memory I set was something like this:

extern "C" void runMemLoadandUnload(float* pdata, int imageWidth, int imageHeight)

{

        float* devPtr;

        CUT_DEVICE_INIT();

        CUDA_SAFE_CALL(cudaMalloc((void **)&devPtr, imageWidth*imageHeight*sizeof(float)*3));

        CUDA_SAFE_CALL(cudaMemcpy(devPtr, pdata, imageWidth*imageHeight*sizeof(float)*3, cudaMemcpyHostToDevice));

	

        const unsigned int num_threads = (imageWidth*imageHeight)/4;

        CUT_CONDITION(0 == ((imageWidth*imageHeight) % 4));

       dim3 grid(1, 1, 1);

        dim3 threads(num_threads, 1, 1);

    

        flipBitsKernel<<< grid, threads >>>((float *)devPtr);

	

        CUDA_SAFE_CALL(cudaMemcpy(pdata, devPtr, imageWidth*imageHeight*sizeof(float)*3, cudaMemcpyDeviceToHost));	

        CUDA_SAFE_CALL(cudaFree(devPtr));

}

The kernel call portion yields the following output during runtime:

I am using cppIntegration from the SDK as my base code, since I am using C++ for the GDAL portion of the code. I think I am getting the error because I did not allocate the memory correctly in the code, but I haven’t found any resource that gives me the general rules for allocating kernel memory.

Any suggestions?

Do you know which call is throwing the error? Nothing in your code looks obviously wrong, unless your image is big enough that num_threads > 512, or you don’t have enough memory on your CUDA device to store your image data. (The latter is unlikely unless you happen to have two CUDA devices, one with very little memory, and CUT_DEVICE_INIT() is picking the wrong device.)

Hi seibert, The image is 2048x2048. I uncommented the lines sequentially and the only time I got the error was when I uncommented the following code.

flipBitsKernel<<< grid, threads >>>((float *)devPtr);

Should I be setting number of threads to be imageHeight * imageWidth * 3? (3 for each float value in RGB)

Is the general rule for allocating memory to be one thread per value? So if you have three floats per pixel, number of threads should be one thread per float?

Hence I was wondering if there is a general set of rules for memory allocation of threads/blocks.

There’s nothing wrong with your memory allocation. The problem is that you are requesting your kernel be run with only one block that has (2048*2048)/4 threads. The maximum number of threads per block is 512, and also, running only one block underutilizes the card. You need to use a grid configuration that has more blocks and fewer threads per block.

(Note that threads is setting the number of threads per block, and not the total number of threads.)

Thank you, I’ll go ahead and give that a try.

Hi seibert, tried it out today…works wonderfully. Thank you very much for explaining the concept.

For the curious, I found an earlier thread that attempted to address this. Sorry about the duplicate post…didn’t find this thread the first time around.

http://forums.nvidia.com/index.php?showtopic=42642