Hello all, this is my first post to the forum, and not likely my last =)
I have started trying to learn CUDA and GPU programming in general. My background is mostly in MPI and C programming.
I have been trying to figure out how to make what I thought would be a simple kernel, but I am having some issues getting my thought process straight on it.
According to my deviceQuery output, my GPU has 16MP, 32cores/mp, blocks max is 1024x1024x64 and I have a max threads/block=1024.
So, I am working on processings some large images. Maybe 5000px x 3500px or something like that. One of my kernels is taking an average of some values across all pixels in the image.
The existing code has the images stored as a 2D array [rows][cols]. So that kernel looks like you’d expect, wtih a loop over rows, and a loop over cols, with the calculation in the middle.
So how do I set up the dimension calculation portion of this code? I have looked at the reduction code int he SDK, but that is for a single dimension array. It doesnt have any mention of how to set up number of blocks and threads for when you have soemthing 2D.
I am thinking I’d actually need to set it up like so, and this is where I’d like someone to chime in and help:
num_threads=1024;
blocksX = num_cols/sqrt(num_threads);
blocksY = num_rows/sqrt(num_threads);
num_blocks = (num_rowsnum_cols)/(blocksXblocksY);
dim3 dimBlock(blocksX, blocksY, 1);
dim3 dimGrid(num_blocks, 1, 1);
Does this seem to make sense?
How would I then access a particular row r and column c in the kernel? In the cuda programming guide I found the following code:
[indent]
// Host code int width = 64, height = 64;
float* devPtr; size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code global void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r)
{
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c)
{
float element = row[c];
}
}
}[/indent]
Which looks similar to how you’d use malloc in C to declare a 2D array, but it doesnt have any mention of accessing that array in your own kernel. I guess in my code, I will use that cudaMallocPitch call, and then perform a memcpy to get my data into the 2D array on the device?
Any tips appreciated! Thanks!