strange problem with double precision and math functions

I have a kernel with some math functions, sqrt(), log() and exp(). All are in double precision, compiled with -arch sm_13. When the number of threads is 102450, the kernel is loaded successfully and the results are correct. But when the number of threads is more than 102450, say 1024*100, the results are not correct. From the output, I can see that the kernel is not loaded at all.

I am using 64 bit ubuntu 8.04 and tesla 1060.

Did anybody have such problem? Or can you give some suggestions to look for the bugs?

Thanks

Did you check the returned error messages (code)? I guess you are out of resource (shared memory or registers…)

Thank you for your help

The number of threads per block is 256. the number of blocks is 1024*50(100)/256.

I also use shared memory. the size of shared memory size is 256.

All these numbers can work well when i used float precesion. I didn’t change them when i changed to double precesion.

When i tried to use simple operations, such like addition and multiplication, in the kernel instead of the double precesion math functions, the kernel is always loaded, no matter how many are the threads.

It seems to be out of registers. When you switch from float to double, the require number of registers is doubled.

I found some clues in section 5.2, cuda programming guide. I tried -maxrregcount -512 to increase the number of registers. This can really solve this problem. Is this the proper way to solve this?

Thank you for kindly help.

But when I increase the number of threads larger, it fails again.

You’ve got the experience External Media . Try to minimize the amount of registers per thread because it affects the maximum possible number of threads per block. You also need to think about increasing the number of blocks per kernel, this method usually improves concurrence.