help with some cuda programming

Hello everyone,

I am trying to write my first CUDA kernel and am having some problems. Before I even try and use shared memory and more expert tricks, I want to get it to work using just global memory.

I have like the following code in C++, which I would like to parallelize using CUDA.

[codebox]

for (int k = 0; k < sizeZ ++k)

{

indexZ = k * numZ;

for (int j = 0; j < sizeY; ++j)

{

   indexY = j * numY;

   for (int i = 0; i < sizeX; ++i)

   {

       indexX = i * numX;

       for (int z = indexZ; z < endZ; ++z)

       {

           for (int y = indexY; y < endY; ++y)

           {

               for (int x = 0; x < endX; ++x)

               {

                    // Some calculations go here

               }

           }

        }

    }

}

}

[/codebox]

Can someone suggest the best approach to go about parallelizing this? How many kernels should I use and how could I organize this?

Sorry for the newbie question. Hope someone can help.

Many thanks,

xargon

Hi!
You should start the kernel in 3 dimensions. One for 0…sizeX, one for 0…sizeY and one for 0…sizeZ. Have a look in the manuals and do it like the matrix addition sample, just for 3 dimensions instead of two.
Then start the loops of z,y,x (like you do in your CPU code) in every kernel and do your calculations.
If you write some code and post it we can correct it here…
Philipp.

Thanks Philipp. Really appreciate it. I will give it a shot today.

One quick question: In my code sizeX, sizeY and sizeZ can be quite large (512, 512, 512). I am little bit confused as to how to organize the thread blocks and grid

So, can I do something like:

[codebox]

dim3 dimBlock(512, 512, 512);

MyKernel<<<1, dimBlock>>>();

[/codebox]

[codebox]

global void MyKernel()

{

int i = blockIDx.x * blockDim.x + threadIdx.x;

int j = blockIDx.y * blockDim.y + threadIdx.y;

int k = blockIDx.z * blockDim.z + threadIdx.z;

}

[/codebox]

How do I organize my blocks, grids and threads to execute as much in parallel as possible. Also, what happens when the input data dimensions are larger than the number of blocks, threads that can be initialized at once…example (1024, 1024, 1024).

Thanks,

xarg

No, you can’t. Blocks can only have a maximum of 512 threads, and it will often be either better or necessary to use less than that per block. Grids can only be two dimensional, but your 512x512x512 data space is easily flattened and contained within a 1024x1024 grid of blocks, each containing 128 threads, for example.

Thanks for the reply. Just another quick question then. So, as you pointed out, the maximum number of threads per block is 512. What about the number of blocks one can have? Is there any restriction on that?

Also, what about if my data space is small (say 4 x 4 x 4). What is the optimum way to create the threads then. I guess I could create a block of (8 x 8) and create 64 threads there. Is that really naive or is there a better way to decide this?

I would be working with data that could be as large as (512 x 512 x 512) or as small as (4 x 4 x 4) and am trying to figure out what considerations go into deciding on how to allocate the thread resources.

Many thanks,

xarg

The limits are laid out in Appendix A of the programming guide, but the current grid dimension limit is 65384x65384.

Blocks can be three dimensional if you want, with dimension limits of (512,512,64) but only a maximum of 512 total threads in a block. So you could easily create a single 4x4x4 block if you so desired.

I have found that the easiest way to do this is to choose a modest starting block size (for example 64 or 128 threads), and code for those block dimensions. Smaller cases can usually be handled simply by padding the input data to match the block size, and larger cases by padding the data space to conform to multiples of the block size. Once you have a running kernel you can start experimenting with block size to see what will give the best throughput (the occupancy spreadsheet supplied with the SDK is useful for this). Block sizes which are multiples of 16/32 are usually preferable for memory coalescing and occupancy reasons. NVIDIA now ship a rather useful “best practices” guide with the toolkit. I recommend taking the time to read it, because it contains a lot of useful information about selecting execution parameters and performance optimisation.

Maybe I missunderstand the programming guide and I never tryed myself, but shouldn’t it be 65k in each dimension? -> (65384 * 65384 * 65384)

Philipp.

It is 65k in each dimension, but grids are only two dimensional.

Ok thanks! I didn’t know that! Why did they do that? Not very user-friendly…