I have a kernel with some math functions, sqrt(), log() and exp(). All are in double precision, compiled with -arch sm_13. When the number of threads is 102450, the kernel is loaded successfully and the results are correct. But when the number of threads is more than 102450, say 1024*100, the results are not correct. From the output, I can see that the kernel is not loaded at all.

I am using 64 bit ubuntu 8.04 and tesla 1060.

Did anybody have such problem? Or can you give some suggestions to look for the bugs?

The number of threads per block is 256. the number of blocks is 1024*50(100)/256.

I also use shared memory. the size of shared memory size is 256.

All these numbers can work well when i used float precesion. I didn’t change them when i changed to double precesion.

When i tried to use simple operations, such like addition and multiplication, in the kernel instead of the double precesion math functions, the kernel is always loaded, no matter how many are the threads.

I found some clues in section 5.2, cuda programming guide. I tried -maxrregcount -512 to increase the number of registers. This can really solve this problem. Is this the proper way to solve this?

You’ve got the experience External Image . Try to minimize the amount of registers per thread because it affects the maximum possible number of threads per block. You also need to think about increasing the number of blocks per kernel, this method usually improves concurrence.