emu vs debug, different values

No, 8192 / 60 registers per thread => 136 threads per block at most. So a blocksize of 128 should still work. If not, you are using too much shared memory when going to a larger blocksize (if your shared memory arrays depend on your blocksize)

I’m not using shared memory at all. Every thread need 60 registers. So how many blocks of 128 threads i can run per grid ? My mathematics are:

8192 / 60*128 = 1

No, you do not understand how the calculation works (you can check in the occupancy calculator)

You can run a 65535x65535 grid. A block runs on at most 1 multiprocessor. That multiprocessor has 8192 registers. So if 1 block of 128 threads uses less than 8192 registers, your code will run just fine, no matter how many blocks you use. If your block uses < 4096 registers, 2 blocks will run at the same time on 1 multiprocessor.

Found my problem, thank you. One more question - where can I read whar “maxregcount” param for compiler really does and how can it help me to increase performance? :)

I think in the NVCC documentation. It tries to limit the maximum amount of registers to the specified number. The downside is that the registers are offloaded to (slow) local memory. The result is that often it slows down performance.

A better option is often to use shared memory to offload some of your variables, as shared memory is much, much faster. You just have to pay attention that you give each thread it’s own element in the shared memory array.

Thank you again!

kyprizel,

The code I was adapting also required 60 registers per thread. I used the maxregcount compile option and set it to 15. I actually saw a nice performance increase because I was able to have more blocks per SM.

Then you probably only needed a small amount of local memory to offload your registers. If you can find out in the generated ptx which variables get offloaded to local memory, you can try to use shared memory for them. That might get you another nice speedup.

I agree with you, double precision is not supported on 8800GTS, learned the hard way but I am happy I am getting what I want.I changed all the doubles to floats and it works now