shared memory and CUDA calculator

Hi all,

I have a question about the shared memory and CUDA occupancy calculator.

I have a kernel that occupies 40 registers and 204+196smem.

I have 8192 registers, so I could lanuch 192threads in parallel, when I use a multiple of 32. According to the CUDA calculator I can reach 25% of processor occpancy when I take a block size of 64. Because of the tregister use, just 3 blocks would be started on a MP.

But what about my shared memory, when I can use 16kB per block and I start 64 threads, each using 404bytes, I would exceed the memory space!?

What is wrong about my calculation, or do I have to choose a max. of 32threads per block, to not exceed the memory, but then, why is that not mentionend in the calculator?!

shared memory is the amount per block, so you do not multiply it by the number of threads.

Just that I get it wright; does that mean, that the amount of shared memory that is returned from the compiler is the memory used in the whole block and not by one thread?

yes, it comes from the fact that the shared memory per block is decided at compile time.
If you have dynamic allocation of shared mem by means of the third parameter in the kernel call, you have to add the amount you ask at runtime to the reported amount, to get the amount of smem used by a block.

In general, you can run the following maximal number of threads:

min(MaxThreadsLimitedByRegs(), MaxThreadsLimitedBySharedMem()).

In my code, I solve it in this manner:

#define MAX_THREADS_PER_BLOCK_REGS(DeviceProps, RegUsage)\

	min(DeviceProps->maxThreadsPerBlock,\

	((DeviceProps->regsPerBlock / (16 * RegUsage)) & ~3) * 16)

#define MAX_THREADS_PER_BLOCK_SHMEM(DeviceProps, SharedMemUsage, SharedMemPerThread)\

	(((((int)DeviceProps->sharedMemPerBlock - SharedMemUsage) / SharedMemPerThread) /\

	DeviceProps->warpSize) * DeviceProps->warpSize)

DeviceProps - pointer to the CUDA structure describing the device;

RegUsage - number of register your kernel eats;

SharedMemUsage - number of bytes of shared mem that are occupied by the kernel (this values is from the cubin file);

SharedMemPerThread - number of bytes of shared mem each thread requires.

First of all, thanks for the explanation, things are getting clearer. But I have a last question, when the compiler does not check the dynamic memory allocation, what happens if I allocate more than maxmemsize? It is outsourced to global memory, so like the local memory or do I get a launch failure?

Shared memory will not be swapped to global - I believe that you will have a launch failure (it is easy to check it out, though).