strange problem with double precision and math functions

I have a kernel with some math functions, sqrt(), log() and exp(). All are in double precision, compiled with -arch sm_13. When the number of threads is 102450, the kernel is loaded successfully and the results are correct. But when the number of threads is more than 102450, say 1024*100, the results are not correct. From the output, I can see that the kernel is not loaded at all.

I am using 64 bit computer and tesla 1060.

Did anybody have such problem? Or can you give some suggestions to look for the bugs?

Thanks

You probably have a race condition somewhere.

Actually, if you are having the kernel not run at all, it sounds more like you are running out of registers for your block. Can you be more clear about your grid and block size? Are you running 1024 blocks with 50 or 100 threads each? Also can you compile with nvcc --ptxas-options=-v? That will show how many registers per thread your kernel requires.

Thank you for your help

The number of threads per block is 256. the number of blocks is 1024*50(100)/256.

I also use shared memory. the size of shared memory size is 256.

All these numbers can work well when i used float precesion. I didn’t change them when i changed to double precesion.

When i tried to use simple operations, such like addition and multiplication, in the kernel instead of the double precesion math functions, the kernel is always loaded, no matter how many are the threads.

When you switch from single to double precision, the registers required per thread will increase (up to a factor of two) because a double precision value requires 2 registers rather than just 1 for single precision. This is why checking the compile with --ptxas-options=-v is important. It will tell you how many registers each thread needs. If that number times 256 (number of threads per block) is greater than 16384, then your kernel won’t even run.

I tried -Xptxas -v option to compile. I got

ptxas info : Compiling entry function ‘_Z5mcgpuPdS_dddj’
ptxas info : Used 22 registers, 64+64 bytes smem, 36 bytes cmem[1]

But I don’t quite understand the ptxas info. I cannot see the relation with what i am using:
The number of threads per block is 256. the number of blocks is 102450(100)/256. The size of shared memory size is 256sizeof(double).

I found some clues in section 5.2, cuda programming guide. I tried -maxrregcount -512 to increase the number of registers. This can really solve this problem. Is this the proper way to solve this?

Is there any more detailed introductions?

Thank you again for kindly help.

When I increase the number of threads higher, it fails again.