Shared/cache memory management for HPC with large data required per thread

On GP100, I’m trying to figure out how best to distribute data to the shared memory (and possibly L1/texture cache - if anyone can tell me how large this cache is on the Tesla P100, I’d appreciate it!). The shared memory size for GP100 devices is 64kb.

Here are my constraints: I need .2 kb of data per thread (double precision) to go into shared memory. So a 4^3 (the largest that can fit) block will have 13kb total. On top of that, I need read-only data from the bordering 6 “walls” of the block, i.e. the outer shell of the 6x6x6 block enclosing the 4x4x4 block, which is ~31kb of read-only data per block. I am certain there can be no optimization done with respect to the required amount of memory.

The options which I am aware of are as follows:

  1. Load all 43kb into shared memory. My understanding is that this would prevent me from loading two blocks onto a SM at once, since 43*2 > 64.

  2. Leave the read-only data to (transparently, yes?) reside in the L2 cache, which I think has slightly higher latency than the shared memory banks, but in principle the device should then be putting two blocks on a SM at once, since 13*2 < 64.

  3. Somehow work with the L1 cache - which I have no concrete ideas for, since I have no idea how large it is, or how to work with it.

It is unfortunate that I max out at 64 threads per block, which is a small occupancy ratio (2:1 threads:FP64 cores), though each thread does have a relatively large number of computations to perform, which helps. But this is why having two blocks per SM sounds attractive. My real question is: would this (option 2) be optimal with respect to balancing occupancy and memory latency, or would I be better off with option 1, or is there an option I’m not thinking of, OR am I wrong about anything?

Thank you for the input!

L1 cache is 24 KB with 128-byte cache lines

do you need indexed access to these data? otherwise, you can put everything in registers - there is 256 KB of register memory per SM. you can even do limited form of indexing via shuffle instructions and conditionals i.e. (index? a1:a0)

Thank you for the response! 24kb is not much…

I’m a bit confused by what I’ve googled about registers - it seems the compiler handles allocation to registers itself? I need a bunch of doubles at each thread’s index, as well as the doubles from that thread’s neighbors in the block/grid (meaning if the neighbor isn’t on the block, I need to load that gridsite’s data to shared memory - or something). I don’t know much (anything) about registers - I haven’t seen their usage come up much in the CUDA toolkit documentation - so I’d greatly appreciate any further explanation or a reference to understand what you’re suggesting!

Regarding register (arrays) see e.g. https://devblogs.nvidia.com/parallelforall/fast-dynamic-indexing-private-arrays-cuda/ . The important thing there is that (as mentioned in the blog posting) that the array indices used to access the ‘register array’ should be compile-time constants. So in a statement ‘double v = myArr[i]’ the index ‘i’ should be resolvable to a constant at compilation.

I would load the ready-only data via the ‘__ldg’ instruction (which takes usage of the texture cache) into register (arrays). There is a lot of information regarding memory access best practices at the various ‘memory bootcamp’ presentations 2015/2016 from the GTC archive (e.g. http://on-demand.gputechconf.com/gtc/2015/presentation/S5376-Tony-Scudiero.pdf ). See also http://www.acceleware.com/blog/constant-cache-vs-read-only-cache . For GP 100, the size of combined L1 / texture cache seems to be 48 KB per SM according to table 2.2 at https://pure.tue.nl/ws/files/39759895/20161018_Li.pdf

L1$=48 KB on SM 6.1, 24 KB on SM 6.0:

(from http://www.hardware.fr/articles/948-2/gp104-7-2-milliards-transistors-16-nm.html )

Volta will be nice for our kernels which rely mostly on texture path and don’t use much shared memory …

Quote from https://devblogs.nvidia.com/parallelforall/inside-volta/ :
"Combining data cache and shared memory functionality into a single memory block provides the best overall performance for both types of memory accesses. The combined capacity is 128 KB/SM, more than 7 times larger than the GP100 data cache, and all of it is usable as a cache by programs that do not use shared memory. Texture units also use the cache. For example, if shared memory is configured to 64 KB, texture and load/store operations can use the remaining 64 KB of L1.

Too bad I just bought a P100 >.< actually not really, the V100 doesn’t look like it’d be worth waiting/paying more.

Texture cache = L1 in both cases, yes? This is the only read-only memory on the SM, I think? Either way, it isn’t a big help - once we’ve got 256-512kb of shared or L1 memory per SM, I’ll be interested.