I have a CUDA kernel with multiple threadblocks which need to access the same data (small, 4 - 8 bytes) from global memory. It is used to be something to early return in a block. I’d like to avoid accessing this small amount of data in every threadblock. is there anyway to force it to be in L2 cache? My GPU is T4
This will happen automatically for you.
The L2 is a device-wide resource. The first time any thread, anywhere, reads the “several bytes” from global memory, it will become resident in the L2 cache. Barring an eviction, all subsequent reads to that item will come from the L2 cache, not global device memory, regardless of thread/threadblock.
The request for the L1 is similar, however the L1 is a per-SM resource.
If you are concerned about evictions from L2, you can investigate L2 set-asides but these are only available on cc8.0 and newer…
If the “several bytes” are read-only, you could consider putting them in
__constant__ memory, or call them out via a
const __restrict__ global pointer passed to the kernel (or even as a kernel argument).
Thank you for the quick rely, Robert!