I’ve decided that L1 management is about three quarters of my job. I have some kernels that are pushing the limits of 128kB of L1 supply, and I have a choice as to whether to stuff the data into arrays in shared or let it be L1. In most cases, if I’m only reading the data, I am trying to condense it into a handful of arrays that will get pre-fetched into L1 (the data may begin with a degree of scattering in global memory, so it then gets compacted and ordered into a block-specific set of arrays, which are also in global memory but exclusive to one thread block so
__syncthreads(); is an effective guard against race conditions).
But, I recall in the early days there was
cudaFuncSetCacheConfig(), which altered the partition of the physical transistors between mostly shared or mostly L1. My question is, given the existence of that API, is there always a region of the L1 that is roped off to be
__shared__ memory, even if it’s only 1/4 of the usual 128kB? Put another way, if my kernel only utilized 13 kB of
__shared__ memory, will there be 115kB of L1 available, or at most 96kB in an automatically inferred “prefer L1” kernel configuration? When my kernels are brushing up against the 128kB cache limits, I can probably modulate what goes into shared and what goes into global pre-fetched arrays, but I’m curious if I need to worry about this at all.