Usage of shared memory

From 1. NVIDIA Ampere GPU Architecture Tuning Guide — Ampere Tuning Guide 12.8 documentation, I noticed words below:
“CUDA reserves 1 KB of shared memory per thread block.”
Does that means, if I choose 132KB for smem per SM and I want 4 threadblock on a SM, I should use 32KB instead of 33KB for smem per threadblock?

yes that is what it means

So does that means if I need 32 warps on a SM, I should try to use 2 threadblock(each have 16 warps) instead of 4(threadblock) * 8(warps/threadblock) or 8(threadblock) * 4(warps/threadblock)?

It’s not obvious to me that there should be a difference. Effectively we could imagine that the shared memory usage is per warp. So if you planned for 128KB total shared usage, and you want 32 warps, then that would be 4KB per warp.

I don’t think it matters if you put those 32 warps into two threadblocks (16 warps per threadblock, 64KB shared used per threadblock) or four threadblocks (8 warps per threadblock, 32KB shared used per threadblock).

I normally don’t worry too much about optimizing below 512 threads per threadblock (16 warps). It tends to work pretty well whether on a GPU with 2048 threads per SM or 1536 threads per SM. But there might be specific code designs that could possibly work better with a smaller number of warps per threadblock. An example might be a code that had large variation in work from thread to thread.

However, for the information you have discussed so far, I don’t think there should be a difference between 16 warps per threadblock/2 blocks, or 8 warps per threadblock/4 blocks, or 4 warps per threadblock/8 blocks. Basically we are talking about occupancy predicated on shared usage, and there should be no difference in the achievable occupancy in those 3 cases.

I means that, if I use 2 threadblock, I have 130KB smem, but for 4 threadblock, I only have 128KB. 8 threadblock will be worse because I only have 124KB smem.

I have never tested it in detail, but is this related to the non-power of 2 shared memory carve out sizes?

E.g. for cc 10.0:

0 / 8 / 16 / 32 / 64 / 100 / 132 / 164 / 196 / 228 KiB (of 256 KiB)

The smaller numbers are powers of 2, the larger numbers are 4 KB higher than a nice multiple of a power of 2.

So (only) the larger shared memory sizes (perhaps for ‘performant compatibility’ or for technical reasons) are optimized for 4 thread blocks, giving a nice number per block?

So the shared memory size of previous generations with a single thread block can be replicated with 4 thread blocks now?

0 / 1 / 3 / 7 / 15 (ignore)
24 / 32 / 40 / 48 / 56 (compatible sizes)

And with more than 4 thread blocks one gets odd maximum numbers again, one is on its own.

There is no critical reason, why one cannot merge several thread blocks, if shared memory size is critical. There are (with PTX assembly) fast synchronization barriers for parts of a thread block, and one can index shared memory with thread ids.

The only disadvantage is granularity to fit on different architectures, e.g. 1024 threads/block do not fit well with 1536 max. threads, and it is not well possible, that several kernels run in parallel (multiple streams).

I run my experiment on A100. So I only have 0, 8, 16, 32, 64, 100, 132 or 164 KB per SM for choice. I wonder will it be better to use 2 threadblock instead of 4 to have extra 2KB for smem

Depends on your algorithm, see the edited in comment at the end of my last reply.

Have you tested, whether the 1KB is a current practical limit or whether it is just reserved by Nvidia, and currently available even with more thread blocks?

I believe the reduction of 1KB per threadblock is at least partly due to declarations made by cooperative groups. I haven’t checked to see if you could use it for yourself if you are not using cooperative groups.

And yes, your observation is correct. If you are using the maximal amount of shared memory, then fewer threadblocks will be better, if your algorithm can actually benefit from having e.g. 4.0625KB (130KB for two blocks/32 warps) per warp as opposed to 4KB per warp (128KB for four blocks/32 warps).

Thanks!I will try it.

Thanks, I got it,

It seems to be undefined behavior to use the reserved portion of the shared memory.
Quoting from the PTX docs 1. Introduction — PTX ISA 8.7 documentation

.sreg .b32 %reserved_smem_offset_begin;
.sreg .b32 %reserved_smem_offset_end;
.sreg .b32 %reserved_smem_offset_cap;
.sreg .b32 %reserved_smem_offset_<2>;

These are predefined, read-only special registers containing information about the shared memory region which is reserved for the NVIDIA system software use. This region of shared memory is not available to users, and accessing this region from user code results in undefined behavior

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.