I tracked down a bug in my code to a problem with the smallest, most gentle-looking piece of GPU kernel code you could ever imagine:
for(int j = 0; j < m; j++)
ddfab = p*c_abcd - dfab*s_abcd; // not a problem
dfab = p*s_abcd + dfab*c_abcd; // PROBLEM LINE !!!
p = ddfab; // again, no problem
Now, my only guess is there might be a race condition (or something else I don’t quite understand) with the problem line “dfab = ps_abcd + dfabc_abcd;” because the variable dfab needs its own value in order to calculate itself. If I take out this line the code will return values, but if I leave it as shown above the kernel fails. And, by ‘failing’ I mean all values of all variables returned from that kernel are 0 in device mode, even though the algorithm works on the CPU in several languages and the GPU code itself works in emulation mode. However, when I take out that line, or take out any self reference the code seems to at least allow the kernel to return values. I should perhaps mention that this is a “global” kernel, and all variables inside the loop are just declared as “float”, while “m” is declared as an int. Also, removing the code from the loop doesn’t help things either, because the kernel still fails even on the first instance of the problem line.
So, to restate: No problem doing this on the CPU, and even better the CPU results gives proper results. Also, no problem running the code in emulation mode and even then that code returns the same good values as the CPU. But, when I run the code on the GPU in device mode things go crazy and the kernel returns all zero’s in the place of the results.
Weird. There really shouldn’t be any problem with that code.
Since I know you are probably putting this in HOOMD, have you enabled the gpu_error_checking flag (gpu_error_checking = true, or --gpu_error_checking on the command line)? Or if this is outside HOOMD, call cudaThreadSynchronize() after the kernel and then check to see if there is an error returned by cudaGetLastError().
I’m suggesting the checking of error values as my guess is that the addition of the innocent piece of code is increasing the register usage to the point where block_size * register_usage per thread is greater than the available registers on one multiproc and CUDA is throwing an error “Too many resources requested for launch” The solution would then be to decrease the block size.
It’s like you know what I want before know I want it!!!
Yes, the code is for HOOMD. And yes, turning on that flag threw the “too many resources…” error. So, I simply decreased the block size by 1/2 and passed this to my kernel and magically it all seems to work. However, I assume the blocksize may change depending on the card, so is there a preferred way to set the blocksize without worrying about running out of registers? Is the proper block size only known empirically?
Thanks again, you have really saved me from my debugging nightmare!
OK, you have to read the mangled name carefully to figure out which function call it is, but the important line here is “reg = 22”. That is a register usage per thread. To launch, regs per thread * block size must be less than the number of reigsters per MP (8192 on older hardware and 16384 on newer hardware, so yet it is hardware dependent)
Now, what you may find for complicated kernels is that the performance also depends on the block size. The only way to determine the optimal performance is empirically by benchmarking every block size that makes sense and picking the fastest. In HOOMD, I just updated the automatic scripts for doing this last week. The following script will run a ~5 minute benchmark of all major kernels in HOOMD and print out the optimal block sizes.
from hoomd_script import *
Though, I haven’t written the code yet for loading the saved optimal block sizes and using them…
Anyways, this is getting more specific to HOOMD, so shoot me an e-mail if you have a question about how tune.find_optimal_block_sizes() works now or if you can’t figure out how to add your code to the list of things it benchmarks.