I don't know that is the problem...


I’m doing a program to calculate forces between particles (molecular simulation). The kernel works ok for n < 256 (n = number of particles and is a multiple of 2) by compared with the same calculus by the CPU, but for n >= 256 I see a stranger behavior of the program: I have to run the program 2 or 3 times to get the same results between GPU and CPU , and when n is too long (approx 2048) I definitively can’t get the same result… finally I only see zeros in the force calculate from the GPU.

I checked the amount of memory but the device (Geforce GTX 550 Ti) have a lot of global memory available. Anyone know that could be the problem? I have a problem in the transfer and copy of memory between Device and HOST? I’m reading the programming guide, but I haven’t find the answer. I use cudaMalloc() and cudaMemCpy() functions to allocate and copy data.

The kernel use a grid of block, so the index i = threadIdx.x and j = threadIdx.y run for 0 to n-1. A pointer of size nn saves the pair forces (between particles i and j) with a index k = i + jn. First the kernel calculate all the pair forces, synchronized the threads and then performance a sum over j to calculate the force for the particles j. The sum is done by reduction (for this reason n is multiple of 2).


It’s really difficult to say without seeing any code. I’ll have a shot in the dark though and remind you that __syncthreads() only synchronizes threads within a block, not between blocks.

Check the CUDA programming guide again… you’ll find a section that discusses the maximum thread block size. The point where your program starts generating zeroes is most certainly because you are either 1) running a dimension larger than the device is capable of or 2) exhausting the register or shared mem resources available to each block. Either would be detected if you called (for debug purposes), cudaThreadSynchronize() after the kernel and then checked the value of cudaGetLastError().

Clearly, the problem is that the kernel exceeds the number of registers per block. By compiler with --ptxas-options=-v I got the number of registers per threads = 25. The total of threads in kernel is nn, and if n = 256, this is 65536. So the total of registers needed for the kernel is 6553625 (a lot :s) … My device supports 8 resident blocks for MultiProcesor and it has 4 MP. With 32 block max, the number of registers por block that the kernel needed is 51200. Due to the max number of register per block for my device is 49152…my kernel won’t work with n >= 256 particles.

I will work in decrease the register for the kernel. Probably the grid is not a good option for my program. I will comment soon… :)

Thanks again.

The maximum number of registers given in the Programming Guide is per Multiprocessor (SM), not for the whole GPU. And if it is not sufficient to support the maximum number of blocks per SM, the number of concurrent blocks per SM will automatically be reduced. So for successful execution it is only necessary to have enough registers for at least one block.

As you haven’t given the blocksize you are using, so I can only guess from the number you give that you are using a blocksize of 64 threads. In this case the total register usage per block would be 25*64=1600 (plus a few registers due to register bank alignment issues), well within the limits of your device.

Also the fact that your kernel returns the correct results if launched a few times with the same parameters indicates that you do not have a problem with the launch configuration (in which case the kernel would just never execute). Instead this seems to point towards a synchronization issue, like the one I mentioned in my previous post.

Finally I’m surprised that for n particles you are launching n[sup]2[/sup] threads. How do you sum the contributions to the force onto one particle - are you using atomicAdd for that? This seems inefficient and I would rather expect having n threads each looping over all other n-1 particles.

I read from the original post that a single block of n x n threads is being launched. In that case, you hit the thread limits on a block long before you hit register limits… max is 1024 threads per block.

Also, the force evaluation for molecular simulation on GPUs is a solved problem - why not learn from others (such as me!): http://codeblue.umich.edu/hoomd-blue/about.html#paper - read General purpose molecular dynamics simulations fully implemented on graphics processing units, its old but all of the algorithms except the neighbor list are still the fastest around today. HOOMD-blue is also open source so you can download, run it, or poke around and see how we do things.

Thanks for the information. I will read the paper for details in the CUDA implementation. It’s a little confusing all about the registers…

I work in molecular simulations in my PH.D program. I made molecular dynamic programs in FORTRAN and I have heard about CUDA recently, so I am trying to avoid completely the loops (for this reason is the n * n threads) and programming in C instead FORTRAN. I don’t have time right now to read the paper, but in a few days I will do it. At least I have downloaded the information to my computer. :D

Thanks again!

I have a doubt… If I have a grid of blocks I have two dimension in the index for the threads…threadIdx.x and threadIdx.y. It’s mean that each block has (number of threads in x) * (number of threads in y) threads? According to the programming guide I understand this is the situation when you use <<<dim3, dim3>>> variables in the call of the kernel.

Thanks !

Yes, that is true for a two-dimensional block.

Is there any way to synchronize the blocks? I’m thinking to call the kernel several times from the host to force the synchronization.

Yes, multiple kernel invocations are the way to go, because blocks can’t be synchronized.

It works !!! :D :D :D XD :D XD XD :). I’m very happy.

Thanks a lot!!!.

However I still have doubts:

1.- There is a maximum number of registers per MP. The device is able to have 8 resident blocks per MP, so the registers for the MP are shared between the 8 resident blocks. When I set the maximum threads for the kernel I have to consider both register maximun and shared memory and in this moment I’m doing this thinking in the 8 resident blocks. Is it an excess or I should consider only one block for the limit threads in the kernel?

2.- The maximum number of blocks for the device is 65536. When I have a grid, is this for each blocksPerGrid.x and bloksPerGrid.y or for the total (blocksPerGrid.x * bloksPerGrid.y)? The program Guide say:

“Maximum x- or y-dimension of a grid of thread blocks = 65536”

probably it’s easy but I don’t understand very well that it means.(I know a bit of english and I don’t understand completely)