On GP100, I’m trying to figure out how best to distribute data to the shared memory (and possibly L1/texture cache - if anyone can tell me how large this cache is on the Tesla P100, I’d appreciate it!). The shared memory size for GP100 devices is 64kb.
Here are my constraints: I need .2 kb of data per thread (double precision) to go into shared memory. So a 4^3 (the largest that can fit) block will have 13kb total. On top of that, I need read-only data from the bordering 6 “walls” of the block, i.e. the outer shell of the 6x6x6 block enclosing the 4x4x4 block, which is ~31kb of read-only data per block. I am certain there can be no optimization done with respect to the required amount of memory.
The options which I am aware of are as follows:
Load all 43kb into shared memory. My understanding is that this would prevent me from loading two blocks onto a SM at once, since 43*2 > 64.
Leave the read-only data to (transparently, yes?) reside in the L2 cache, which I think has slightly higher latency than the shared memory banks, but in principle the device should then be putting two blocks on a SM at once, since 13*2 < 64.
Somehow work with the L1 cache - which I have no concrete ideas for, since I have no idea how large it is, or how to work with it.
It is unfortunate that I max out at 64 threads per block, which is a small occupancy ratio (2:1 threads:FP64 cores), though each thread does have a relatively large number of computations to perform, which helps. But this is why having two blocks per SM sounds attractive. My real question is: would this (option 2) be optimal with respect to balancing occupancy and memory latency, or would I be better off with option 1, or is there an option I’m not thinking of, OR am I wrong about anything?
Thank you for the input!