Particles, threads and blocks

I’m working on a simple Smoothed Particle Hydrodynamics problem, assuming one particle to be one thread, initially with 400 particles/threads to be enlarged to tens of thousands eventually.

I am wondering what happens when I declare my grid configuration as
#define NTOTAL 400

dim3 dimGrid(1,1);
dim3 dimBlock(NTOTAL,1,1);

//Launch the device computation
device_Calculate<<<dimGrid, dimBlock>>>(…)

as compared to

dim3 dimGrid(1,1);
dim3 dimBlock(32,13,1);

//Launch the device computation
device_Calculate<<<dimGrid, dimBlock>>>(…)

with
threadid=blockIdx.x*blockDim.x + threadIdx.x;

if(threadid<NTOTAL)
{

}

///////
Questions I have are:

  1. what is the difference between the two configurations?

  2. is there a more efficient configuration, and if so, why?

Thanks in advance.

There’s no efficiency difference between using a 1D vs 2D set of threads, it’s mostly a convenience. So what remains is the question about firing off 400 threads in one block, or firing off 32*13=416 threads, then using a test to do your work in just the first 400.

The answer is it really doesn’t matter MUCH, if you run them you’ll get the same result and there won’t be any noticeable runtime difference.

In the first case, the kernel scheduler will run blocks in warps of 32 threads at a time, and for the last warp, half the threads will be disabled/skipped.

In the second case, the exact same thing happens except your own explicit test in the code is what discards the last unneeded 16 threads.

But the first method is still preferred for three reasons. First, your code is shorter and cleaner.

The second reason is for potential later flexibility and efficiency. If later hardware allows finer grained warps of 8 threads for example… the manual method #2 will waste kernel scheduling by creating and firing off useless multiple extra warps which are immediately killed by your manual thread ID test.

The final, and most important reason to use method #1: the explicit manual test method is harder to maintain and understand because you have one thread size and one “effective” size. You have to keep them in sync with each other, and use your brain to think about their relationship. It’s unneeded complexity.

It’s actually a good question, you have to think a little about how the kernel threads are scheduled.

So the preferred method is #1, let the CUDA drivers and hardware decide the scheduling, don’t try to manually tweak it. It won’t really make much difference in practice, but that’s all the more reason not to manually muck with the threads.

That is the answer I was anticipating. Thanks.

I find that 1D grids are much simpler to map to particle systems (I’m doing MD, which isn’t so different from SPH)

dim3 dimGrid(ceil(NTOTAL/block_size),1);

dim3 dimBlock(block_size,1,1);

...

__global__ void kernel()

    {

    int pidx = blockIdx.x + blockDim.x + threadIdx.x;

   // perform calculations on particle pidx

    }

This setup nicely gives coalescing when you read array[pidx] assuming the array is a 32, 64, or 128-bit type. It also makes block_size a variable. When you have a working kernel, benchmark it with block sizes from 32 up to the max you can do (due to register/shmem limitations) in multiples of 32. Many kernels can change performance by 50% or more just by changing the block size.

maybe you already found out, but here you are using blockIdx.x while you only have 1 block, so you will have 13 threads all with the same calculated threadid. threadid will be between 0 and 31.

you want dimGrid(13,1,1) dimBlock(31,1,1) for the threadid calculation as in your example.