Thread and block partition

FatGarfield · December 4, 2007, 9:02am

Hi,

 I programmed CUDA applications for 2 months, and I have a question on how CUDA distribute blocks and threads to multiprocessors.

say, I have a global funtion:

     __global__   myfunction()
                        { 
                          ...........

                             float ftemp[9];
                           ...........
                          }

    
     main()
    {
       ........
       myfunction<<<8,     112>>> ();  
       .........
    }


   for the performance purpose, I want the ftemp[9] array reside in registers.
 thus the best configuration is:

 for 8600 GTS, each multiprocessor takes 2 blocks, and each block contains 112 threads, thus the registers per thread can hold is 8192 / ( 2 * 112) = 36.5, so ftemp[9] can be allocated in local memory.

But how can I make it sure that CUDA will be clever enough to do this rather than let each multiprocessor take 4 blocks, and allocates ftemp[9] in slow local memory?

please help me, thanks.

DenisR · December 4, 2007, 9:06am

As far as I have read until now, shared memory is as fast as registers, so declare your array shared and avoid bank conflicts. That should give you the same speed as registers.

FatGarfield · December 4, 2007, 10:17am

Thank you for the repy, but I think accessing the shared memory needs additional address calculation, and the shared memory is used for inter-threads cooperation.

AndreiB · December 4, 2007, 11:25am

run nvcc with --keep option and check resulting .cubin file. It will show you how many registers, shared memory and local memory your kernel uses.

If you access ftemp with non-constant indexes it will most likely be placed in local memory.

Topic		Replies	Views
questions on register, local memory and block CUDA Programming and Performance	5	4968	February 28, 2008
Register vs local memory Forcing NVCC to use registers CUDA Programming and Performance	4	3162	June 29, 2007
Performance Threads, blocks, registers and shared memory CUDA Programming and Performance	2	5999	July 29, 2009
So how much shared mem do we really have ? knowing cuda hw better = better optimization CUDA Programming and Performance	0	1948	November 20, 2009
Threads per block Using shared memory CUDA Programming and Performance	11	2547	October 20, 2010
Registers and Shared Memory question CUDA Programming and Performance	7	5531	September 10, 2007
efficiency of block/thread ratios CUDA Programming and Performance	2	3872	April 18, 2007
Help me to understand Global vs Local Memory performance. CUDA Programming and Performance	19	24986	December 21, 2009
Thread Local Memory CUDA Programming and Performance	1	7158	January 26, 2016
shared memory and threads question CUDA Programming and Performance	9	6009	August 30, 2007

Thread and block partition

Related topics