Anyway to force several bytes to be in L1/L2 cache so that I can use it across multiple threadblocks within one kernel?

I have a CUDA kernel with multiple threadblocks which need to access the same data (small, 4 - 8 bytes) from global memory. It is used to be something to early return in a block. I’d like to avoid accessing this small amount of data in every threadblock. is there anyway to force it to be in L2 cache? My GPU is T4

This will happen automatically for you.

The L2 is a device-wide resource. The first time any thread, anywhere, reads the “several bytes” from global memory, it will become resident in the L2 cache. Barring an eviction, all subsequent reads to that item will come from the L2 cache, not global device memory, regardless of thread/threadblock.

The request for the L1 is similar, however the L1 is a per-SM resource.

If you are concerned about evictions from L2, you can investigate L2 set-asides but these are only available on cc8.0 and newer…

If the “several bytes” are read-only, you could consider putting them in __constant__ memory, or call them out via a const __restrict__ global pointer passed to the kernel (or even as a kernel argument).

1 Like

Thank you for the quick rely, Robert!