Thread Batching and Memory model

Here’re the pages I have to show you:

[url=“CUDA Toolkit Documentation”][/url]

If you download the file and take a glance at pages from 8 to 11, the guide talks about thread batching and memory model. On page 9 of the guide, it’s thread batching to show how the threads are actually packed into a block. This is something I don’t understand. What it confuses me more is that, when you look at the figure about memory model, there are only two threads between the registers and the local memory. My questions are:

  1. From figure 2-1, how the threads exactly packed to a single block? It seems to me that there are fifteen threads mapped into the single block.

  2. When you look at figure 2-2, why are there only two blocks in a grid, but there are six blocks in a grid? Would I possibly mix something up?

I was trying to draw a big picture of how the threads and blocks are arranged in a single memory. Let me know.


These two figures are simple examples and not connected to each other. Think that they show different kernels and diffferent execution configurations.

Threads are not arranged in memory. Thread is just piece of code that executes on device.

Thread is just piece of code that executes on device

First of all, sorry for my bad wording on the question 2.

Yes, we all know what threads are, so I think the important question now is:

  1. How many max. threads can be excuted into a single block of the grid?

  2. How many max. blocks can be taken care o by a single grid?

  3. At this point we will know how many threads can be handled by a single grid. Next Question is, can we define what goes to a single thread? Or is it just the hardware determines that?


I think it is 512 threads for each block and 65535 blocks for each grid if one dimension.

I do not think we can clearly assign a thread to a certain multiprocessor, but it is not necessary as well :)

This information is in Programmin Manual, A.1

  1. Max 512 threadds per block.
  2. Max. grid is 65535x65535 blocks.
  3. No control over thread scheduling.

Suppose I am about to run a program of n-body problem. Let’s say you have 1024 objects circuling around the core of a system, say, the earth.

You have stream processors to take care of the gravitational force calculation, and calculations of acceleration and location of each object should also be a piece of cake.

So if we have 3D here, x, y and z dimensions,

  1. How many exactly will we have for the threads and the blocks?

  2. I am also thinking of a way to store the data of location(this is what it matters the most for the output). So how could we make the best use of the locality when we are about to store the data into the memory? WHAT exactly the memory will be the best to store the data to?(consider the global memory is a cause of the bottleneck of the entire execution…)

Let me know.


For n-body problem, you may be better off using 1D grid and 1D block. There isn’t much locality issue except for coalescing and texture cache.
If there are only 1024 bodies, you can just stuff them all into constant memory, then access it as global memory for coalesced read or write, and as constant memory for all-thread-from-same-location read.

The next SDK will include an N-body code.

If you are calculating forces between all N^2 particle pairs, then the best way is to use a sliding window type technique. Each 1D block handles M particles. Each block first loads in the first M particles to shared memory, then every thread in the block loops over the shared memory and sums the force between it’s particle and the current particle in shmem, ignoring self-interactions of course (I can post a simple code demonstrating this if you need it). Then the block moves on to the next M particles. This method uses the shared memory quite effectively and you are only limited by the FLOPs needed for the forces.

Here, there is no need to do anything special for data locality, since every block is performing a fully coalesced read to get particle positions.

If your forces are short range and you only need to sum them within a cutoff, data locality becomes much more important and I use a Hilbert space-filling curve to resort the particles. All aspects of this problem have been solved in a recent paper I submitted.

The n-body sample in CUDA 1.1 SDK is also described in a GPU Gems 3 article (an early version of the sample is included on the book’s DVD).