Grids and Threads question

Okay everyone seems to be posting questions about threads and grids without an answer that really leaves me feeling confident at setting the parameters of threads and grids for <<<>>> and then using them correctly in the function on the GPU.

I’ve been managing so far by just using the MatrixMul example and modifying it for my needs. I am however now needing to perform some much more complicated algorithims and really need a full understanding of how I should be using these parameters. Can someone give me a quick tutorial (preferrably with some sample code) on how I should be setting the values for threads and grids and how I use them to access data in the functions. My application at the moment runs on an FPGA and calls the one function about 1 million times performing the same calculation on an array of data.

I know this is a totally newbie question but the programming guide isn’t that clear and a reply to this would probably speed up the development time for lots of other people just getting into CUDA.




I haven’t been into CUDA for a very long time, but I think I can (in a way) help you.

The size of the blocks determines the performances, as well as the amoutn of registers and shared memory you are using.

To know those quantities, just compile your code using the -cubin option, that will produce a specific file.

code  {

	name = timestep_KER

	lmem = 0

	smem = 56

	reg = 11

	bar = 0

	bincode  {

	//interesting things here  ^_^ 


	const  {

  segname = const

  segnum = 1

  offset = 0

  bytes = 8

  mem  {

  	0x000003ff 0x7f800000 




Here is a sample of mine. The file contains others info, but for now we only need these (and I don’t uderstantd the rest of the file :rolleyes: ).

Name is the name of your kernel, of course.

Smem is the amount of shared memory (in byte) required by your kernel. You should take care not to declare too much shared variables in order to allow the GPU to run a high number of blocks concurrently on multiprocessors.

Reg is the number of registers required by the kernel. The same attention should be payed toward this value.

With those two values, you can use the CUDA Occupancy calculator (available here).

The calculator provides a little info on how it works. For me, it was enough. If you have questions, just post here again !

To finish, some piece of advice from my experience: use 2D blocks instead of 1D block (eg: 1616 instead 2561). If you’re out of shared memory (or want to save it), use the texture memory instead of the global one.

Good luck !

How you set these parameters is entirely dependent on your algorithm. Forget about registers and block size choices, etc… at first. The first and foremost thing that determines how you set these parameters is the way you choose to break up your algorithm! The first step is think about your algorithm and come up with ways to break it up into blocks that can run independently. There may be several ways to do this.

Here are a few common patterns that I use.

  1. Lets say we have N array elements and each can be processed independently from the others. The simplest way to index into that is to have each thread handle a single element of the array.
__global__ void process_array(float *in, float *out, int N)

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N)

    out[idx] = f(in[idx]);


This setup would be run with a grid of (N/BLOCK_SIZE+1,1,1) and a block size of (BLOCK_SIZE,1,1). Choosing BLOCK_SIZE is complicated based on the number of registers and such. I’ve found the easiest way is just to run some benchmarks with every possible block size from 32 to as big as you can run to find the fastest timing.

  1. To illustrate my point about multiple ways to break up a computation. Imagine that a LOT of processing goes into each array element i and that this processing can be done in parallel to some extent. Thus, it may (and I stress MAY) be advantageous to have each BLOCK calculate the value for a single element.
__global__ void process_array(float *in, float *out, int N)

int idx = blockIdx.x;

initialize shared memory by reading in


tmp = parallel processing in block operating on shared memory

if (threadIdx.x = 0)

    out[idx] = tmp;


For this setup, grid would be (N,1,1) and the block size would be (BLOCK_SIZE,1,1). Depending on the parallel algorithm you use to do the calculations, block size may not be a free parameter, it may be determined by the size of your data set or something.

AFTER you have a few ideas and maybe some proof-of-concept code for different ways to arrange your computation into blocks, THEN it is time to start thinking about memory coalescing, divergent warps, shared memory bank conflicts, occupancy and all that other fun stuff. I’ve found that if you obsess about these things too much at first, you can close yourself off to trying an implementation that turns out to be faster (I’m speaking from experience).

If you have questions about how to break a particular type of computations up into blocks, just ask and we’ll be glad to give you some ideas.