On GP100, I’m trying to figure out how best to distribute data to the shared memory (and possibly L1/texture cache - if anyone can tell me how large this cache is on the Tesla P100, I’d appreciate it!). The shared memory size for GP100 devices is 64kb.
Here are my constraints: I need .2 kb of data per thread (double precision) to go into shared memory. So a 4^3 (the largest that can fit) block will have 13kb total. On top of that, I need read-only data from the bordering 6 “walls” of the block, i.e. the outer shell of the 6x6x6 block enclosing the 4x4x4 block, which is ~31kb of read-only data per block. I am certain there can be no optimization done with respect to the required amount of memory.
The options which I am aware of are as follows:
Load all 43kb into shared memory. My understanding is that this would prevent me from loading two blocks onto a SM at once, since 43*2 > 64.
Leave the read-only data to (transparently, yes?) reside in the L2 cache, which I think has slightly higher latency than the shared memory banks, but in principle the device should then be putting two blocks on a SM at once, since 13*2 < 64.
Somehow work with the L1 cache - which I have no concrete ideas for, since I have no idea how large it is, or how to work with it.
It is unfortunate that I max out at 64 threads per block, which is a small occupancy ratio (2:1 threads:FP64 cores), though each thread does have a relatively large number of computations to perform, which helps. But this is why having two blocks per SM sounds attractive. My real question is: would this (option 2) be optimal with respect to balancing occupancy and memory latency, or would I be better off with option 1, or is there an option I’m not thinking of, OR am I wrong about anything?
do you need indexed access to these data? otherwise, you can put everything in registers - there is 256 KB of register memory per SM. you can even do limited form of indexing via shuffle instructions and conditionals i.e. (index? a1:a0)
I’m a bit confused by what I’ve googled about registers - it seems the compiler handles allocation to registers itself? I need a bunch of doubles at each thread’s index, as well as the doubles from that thread’s neighbors in the block/grid (meaning if the neighbor isn’t on the block, I need to load that gridsite’s data to shared memory - or something). I don’t know much (anything) about registers - I haven’t seen their usage come up much in the CUDA toolkit documentation - so I’d greatly appreciate any further explanation or a reference to understand what you’re suggesting!
Regarding register (arrays) see e.g. https://devblogs.nvidia.com/parallelforall/fast-dynamic-indexing-private-arrays-cuda/ . The important thing there is that (as mentioned in the blog posting) that the array indices used to access the ‘register array’ should be compile-time constants. So in a statement ‘double v = myArr[i]’ the index ‘i’ should be resolvable to a constant at compilation.
Volta will be nice for our kernels which rely mostly on texture path and don’t use much shared memory …
Quote from https://devblogs.nvidia.com/parallelforall/inside-volta/ :
"Combining data cache and shared memory functionality into a single memory block provides the best overall performance for both types of memory accesses. The combined capacity is 128 KB/SM, more than 7 times larger than the GP100 data cache, and all of it is usable as a cache by programs that do not use shared memory. Texture units also use the cache. For example, if shared memory is configured to 64 KB, texture and load/store operations can use the remaining 64 KB of L1.
Too bad I just bought a P100 >.< actually not really, the V100 doesn’t look like it’d be worth waiting/paying more.
Texture cache = L1 in both cases, yes? This is the only read-only memory on the SM, I think? Either way, it isn’t a big help - once we’ve got 256-512kb of shared or L1 memory per SM, I’ll be interested.