I’m sure I’m misunderstanding the language here. But the CUDA device query sample tells me my P100 has 64kb of “constant memory” and 48kb of “shared memory per block.” Block diagrams of the P100 SM indicate that it has a 64kb shared memory and a 24kb L1 cache. Can someone explain how this is consistent?
one SM can run multiple thread blocks simuiltaneously, and those 64 KB are split between those blocks. 48 kb is a max. amount of memory that a single thread block can use, independent on SM and CUDA version, i.e. it’s a software limitation that enusres that CUDA program can run on any hardware and with any CUDA version
constant memory also doesn’t have anything common with the L1 cache
Thank you for the reply, that makes sense. So if I don’t have two blocks per SM I’m “wasting” 16kb of shared memory. Bummer.
Would you mind differentiating constant memory and the L1 cache? What specifier do they get?
As a general rule for the efficient use of GPUs, one should always strive to have a least two thread blocks running per SM.