shared memory and CUDA calculator

tomasito · October 25, 2008, 8:16pm

Hi all,

I have a question about the shared memory and CUDA occupancy calculator.

I have a kernel that occupies 40 registers and 204+196smem.

I have 8192 registers, so I could lanuch 192threads in parallel, when I use a multiple of 32. According to the CUDA calculator I can reach 25% of processor occpancy when I take a block size of 64. Because of the tregister use, just 3 blocks would be started on a MP.

But what about my shared memory, when I can use 16kB per block and I start 64 threads, each using 404bytes, I would exceed the memory space!?

What is wrong about my calculation, or do I have to choose a max. of 32threads per block, to not exceed the memory, but then, why is that not mentionend in the calculator?!

E.D_Riedijk · October 25, 2008, 8:32pm

shared memory is the amount per block, so you do not multiply it by the number of threads.

tomasito · October 25, 2008, 8:41pm

Just that I get it wright; does that mean, that the amount of shared memory that is returned from the compiler is the memory used in the whole block and not by one thread?

E.D_Riedijk · October 26, 2008, 5:13am

yes, it comes from the fact that the shared memory per block is decided at compile time.
If you have dynamic allocation of shared mem by means of the third parameter in the kernel call, you have to add the amount you ask at runtime to the reported amount, to get the amount of smem used by a block.

Romant · October 26, 2008, 8:48am

In general, you can run the following maximal number of threads:

min(MaxThreadsLimitedByRegs(), MaxThreadsLimitedBySharedMem()).

In my code, I solve it in this manner:

#define MAX_THREADS_PER_BLOCK_REGS(DeviceProps, RegUsage)\

	min(DeviceProps->maxThreadsPerBlock,\

	((DeviceProps->regsPerBlock / (16 * RegUsage)) & ~3) * 16)

#define MAX_THREADS_PER_BLOCK_SHMEM(DeviceProps, SharedMemUsage, SharedMemPerThread)\

	(((((int)DeviceProps->sharedMemPerBlock - SharedMemUsage) / SharedMemPerThread) /\

	DeviceProps->warpSize) * DeviceProps->warpSize)

DeviceProps - pointer to the CUDA structure describing the device;

RegUsage - number of register your kernel eats;

SharedMemUsage - number of bytes of shared mem that are occupied by the kernel (this values is from the cubin file);

SharedMemPerThread - number of bytes of shared mem each thread requires.

tomasito · October 26, 2008, 11:26am

First of all, thanks for the explanation, things are getting clearer. But I have a last question, when the compiler does not check the dynamic memory allocation, what happens if I allocate more than maxmemsize? It is outsourced to global memory, so like the local memory or do I get a launch failure?

Romant · October 26, 2008, 4:38pm

Shared memory will not be swapped to global - I believe that you will have a launch failure (it is easy to check it out, though).

Topic		Replies	Views
Not enough shared mem CUDA Programming and Performance	5	5774	November 3, 2009
Maximal threads per block calculation Calc based in reg and shared mem usage.. CUDA Programming and Performance	7	4981	June 30, 2008
maximum number of blocks CUDA Programming and Performance	3	2384	April 10, 2008
Shared memory per block Related to shared memory of an MCPU CUDA Programming and Performance	3	3987	August 14, 2007
Occupancy wierdness.... Is the calculator wrong? CUDA Programming and Performance	5	5913	July 25, 2007
Shared memory and register usage - just 1 thread/block CUDA Programming and Performance	1	799	July 21, 2009
Execution Of Thread-Blocks CUDA Programming and Performance	4	5284	June 18, 2007
Shared memory limits and cudaError_enum How to precisely determine how much of the shared memory is CUDA Programming and Performance	5	2816	April 29, 2009
Maximising memory per thread CUDA Programming and Performance	4	3274	May 3, 2010
shared memory usage per Block VS per SM CUDA Programming and Performance	2	8548	May 3, 2010

shared memory and CUDA calculator

Related topics