Hi,
I programmed CUDA applications for 2 months, and I have a question on how CUDA distribute blocks and threads to multiprocessors.
say, I have a global funtion:
__global__ myfunction()
{
...........
float ftemp[9];
...........
}
main()
{
........
myfunction<<<8, 112>>> ();
.........
}
for the performance purpose, I want the ftemp[9] array reside in registers.
thus the best configuration is:
for 8600 GTS, each multiprocessor takes 2 blocks, and each block contains 112 threads, thus the registers per thread can hold is 8192 / ( 2 * 112) = 36.5, so ftemp[9] can be allocated in local memory.
But how can I make it sure that CUDA will be clever enough to do this rather than let each multiprocessor take 4 blocks, and allocates ftemp[9] in slow local memory?
please help me, thanks.