trouble learning how to set block and max thread size

Hello all, this is my first post to the forum, and not likely my last =)

I have started trying to learn CUDA and GPU programming in general. My background is mostly in MPI and C programming.

I have been trying to figure out how to make what I thought would be a simple kernel, but I am having some issues getting my thought process straight on it.

According to my deviceQuery output, my GPU has 16MP, 32cores/mp, blocks max is 1024x1024x64 and I have a max threads/block=1024.

So, I am working on processings some large images. Maybe 5000px x 3500px or something like that. One of my kernels is taking an average of some values across all pixels in the image.

The existing code has the images stored as a 2D array [rows][cols]. So that kernel looks like you’d expect, wtih a loop over rows, and a loop over cols, with the calculation in the middle.

So how do I set up the dimension calculation portion of this code? I have looked at the reduction code int he SDK, but that is for a single dimension array. It doesnt have any mention of how to set up number of blocks and threads for when you have soemthing 2D.

I am thinking I’d actually need to set it up like so, and this is where I’d like someone to chime in and help:

num_threads=1024;
blocksX = num_cols/sqrt(num_threads);
blocksY = num_rows/sqrt(num_threads);
num_blocks = (num_rowsnum_cols)/(blocksXblocksY);

dim3 dimBlock(blocksX, blocksY, 1);
dim3 dimGrid(num_blocks, 1, 1);

Does this seem to make sense?

How would I then access a particular row r and column c in the kernel? In the cuda programming guide I found the following code:
[indent]
// Host code int width = 64, height = 64;
float* devPtr; size_t pitch;
cudaMallocPitch(&devPtr, &pitch, width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code global void MyKernel(float* devPtr, size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r)
{
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c)
{
float element = row[c];
}
}
}[/indent]

Which looks similar to how you’d use malloc in C to declare a 2D array, but it doesnt have any mention of accessing that array in your own kernel. I guess in my code, I will use that cudaMallocPitch call, and then perform a memcpy to get my data into the 2D array on the device?

Any tips appreciated! Thanks!

crickets…no views, no replies at all?

Welcome to the forum! Sorry to let it go unanswered for so long. You are quite spot on anyway, which might explain why your post did not attract comments so far…

There are few general rules, much depends on the individual problem at hands, and the optimal total number of threads per block is often found by trial and error. But there are a few common rules:

Unless your kernel has been carefully designed to exploit instruction-level parallelism, you want at least 192 threads on each MP for compute capability 1.x devices, so that latency can be hidden by scheduling different threads. For 2.0 devices the number is approximately 576, for 2.1 devices it can be up to 864. Additional threads beyond these numbers have less benefit, but they still help hiding memory latencies.

Having more than one block per SM gives a slight advantage as it allows to schedule work from other blocks during __syncthreads() and as it helps to even out memory accesses which tend to come in bunches within each block. On the other hand, having more than one block per SM reduces the amount of shared memory and cache available to each block.

Keep the number of threads per block a multiple of 32 as you did, so that no threads in a warp are wasted. Keeping it a multiple of 64 is slightly advantageous as the scheduler seems to be optimized for that.

How you arrange these threads in x, y, and z is mostly for your convenience. Keeping dimBlock.x a multiple of 16 (32 for compute capability 2.x) often gives you good coalesced memory access patterns. Having dimBlock.y and dimBlock.z larger than 1 usually has no merits unless one of the following applies:

    [*]You explicitly block data in shared memory that can be reused.

    [*]You do the same implicitly using the cache on 2.x devices.

    [*]You use 2d textures so that 2d locality will improve the texture cache hit rate.

    [*]You need them to achieve the desired blocksize while you want to keep dimBlock.x small for some other reason.

EDIT: (consider warp-wide memory access on Fermi)

Hello, and thank you very much for some of the insight. I do have a couple of follow up questions.

Do you think if I am doing something like a 2D reduction (average of values in the matrix, lets say), I am better off keeping that 2D array, or using some of the CUDA API to copy the 2D array into a 1D array, so I can use something like the 1D array reduction found

It seems like since I read this blog post, I have seen better examples that do not waste threads as the algorithm progresses. It does seem like reductions in more than one dimension are missing in the SDK, or I am missing them perhaps. Do you know of any resources with examples for dealing with data in 2D arrays that really spell it out?

You say I should probably use a dimbBlock.x and keep the y and z components set to one. Why is that? Is it possible to use a block that has the y and z dimension set to one, when I am working on a 2D C array?

[font=“Courier New”]cudaMallocPitch()[/font] is just a helper function that appropriately rounds up the image width before calling [font=“Courier New”]cudaMalloc()[/font]. This is quite important on compute capability 1.x devices to achieve good memory access patterns. Your 2.0 device is much less susceptible to misaligned accesses, so I would not bother with that at the moment.
Then your array works exactly the same as in ordinary C, which comes in handy as you can reuse it without copying as input to a 1d reduction routine as laid out in the SDK.

I did not want to say in my previous post that you should keep dimBlock.y and dimBlock.z as one. All I wanted to say is that it probably makes no difference performance-wise as long as the total number of threads per block remains the same. Actually a dimBlock.y>1 is good here as that keeps dimBlock.x small so that the image width can remain closer to a multiple of dimBlock.x.