Thread and block partition


 I programmed CUDA applications for 2 months, and I have a question on how CUDA distribute blocks and threads to multiprocessors.

say, I have a global funtion:

     __global__   myfunction()

                             float ftemp[9];

       myfunction<<<8,     112>>> ();  

   for the performance purpose, I want the ftemp[9] array reside in registers.
 thus the best configuration is:

 for 8600 GTS, each multiprocessor takes 2 blocks, and each block contains 112 threads, thus the registers per thread can hold is 8192 / ( 2 * 112) = 36.5, so ftemp[9] can be allocated in local memory.

But how can I make it sure that CUDA will be clever enough to do this rather than let each multiprocessor take 4 blocks, and allocates ftemp[9] in slow local memory?

please help me, thanks.

As far as I have read until now, shared memory is as fast as registers, so declare your array shared and avoid bank conflicts. That should give you the same speed as registers.

Thank you for the repy, but I think accessing the shared memory needs additional address calculation, and the shared memory is used for inter-threads cooperation.

run nvcc with --keep option and check resulting .cubin file. It will show you how many registers, shared memory and local memory your kernel uses.

If you access ftemp with non-constant indexes it will most likely be placed in local memory.