L1 Cache, L2 Cache and Shared memory in Fermi

Couple of question on these issues:-

(1) Are L1 caches flushed out after a kernel finishes its execution or is it copied back to L2 without flushing the L1 cache ? What kind of consistency is available for the data written into L1 cache ?

(2) From a CUDA program, can we load/store some data structure from/to global memory by by-passing L1/L2 caching during runtime ? This is significantly important for many applications that contains few large data structure (of size 100MB or more), and these data structure does not have any locality, and can hurt the existing L2 cached contents. PTX 2.3 manual page 110 has some cache streaming instruction, but it might be beneficial to expose cache operator using C intrinsic functions.

int global_mem[100];

__global__ void kernel()


   __shared__ int mem[10];

for (int i=threadIdx.x; i < 100; i += blockDim.x) 

       mem[i] = load_streaming ( global_mem[i]);


If a CUDA platform does not have this streaming function implemented than the load become like a regular load. This can be easily ignored by the compiler when streaming is not supported.

(3) Is there any possibility to have persistent shared memory or L1 cache content ? For example, assume that a given application can launch and execute a kernels with 32 blocks concurrently. Is it possible to ensure that during the next launch of the same kernel (again with 32 blocks of threads), they are executed on the same multiprocessors (block mapped to the same processor), and the shared memory/L1 cache contents are preserved ? This can potentially reduce significant global memory accesses and also reduce kernel launch overhead ?


Programming guide 3.2, mentions page-locked memory can be allocated with the flag cudaHostAllocWriteCombined, therefore avoid using of L1 and L2 cache.

G4.1: The cache behavior (e.g. whether reads are cached in both L1 and L2 or in L2 only) can be partially configured on a per-access basis using modifiers to the load or store instruction.

Looks like it’s possible… i’m still digging…

apparently I didn’t get your question… why don’t you just edit the ptx file generated to change the cache operators?

for question 3, unless you do a lot of launches and do very little thing in each launch, filling up the cache would not take up a significant part of the kernel execution time, would it?

For question one, you can create a test kernel which makes use of global memory at fixed address only once and then returns. launch the kernel many times and time the total execution time. Though you may want to launch a relatively time-consuming kernel first before you start launching the test kernel, because the first test kernel execution may have finished before the next launch command arrives at the GPU - so you use a long kernel before the test kernels to queue up the test kernel launches at the GPU. In one version of the kernel, you can do all the loads with .cs, and in another do all the loads with .ca, and compare the total execution time.

The only guarantee that CUDA makes by default is that writes to global memory are flushed by the end of kernel execution. However, you can modify that in several ways (some already mentioned):

  • Use inline PTX to directly apply modifiers that specify the load or store instruction should bypass L1.

  • Use nvcc command line option “-Xptxas -dlcm=cg” to specify that all global reads and writes in the kernel should bypass L1. This is not as helpful if your goal is to stream some reads and not others.

  • You can force the L1 cache to flush back up the memory hierarchy using the appropriate _threadfence*() function. __threadfence_block() requires that all previous writes have been flushed to shared memory and/or the L1. __threadfence() additionally forces global memory writes to be visible to all blocks, and so must flush writes up to the L2. Finally, __threadfence_system() flushes up to the host level for mapped memory.

Agreed. In CUDA 4.0, inline PTX will be officially supported (it was available before, but undocumented), and so it will be possible to embed the modified read instruction in your C code. A C-level modifier would be nice, but it looks like the CUDA C language designers are reluctant to introduce additional deviations from the C language. Maybe someday…

Short answer: no.

Long answer: The cache content at the start of a kernel and mapping of blocks to multiprocessors is not defined. The CUDA driver is free to update the display and run kernels from other contexts between kernel calls from your program. Any of those things could modify the cache contents. Similarly, on some devices CUDA allows multiple kernels to be active at the same time, which would prevent a fixed block to multiprocessor mapping.

As mentioned below, if the lack of persistent caching across kernel calls is a bottleneck for your code, you should try to do more calculations per memory read in each kernel call.

Seibert, hyqneuron,

Thanks for the replies.

I especially appreciated the importance of threadfence() function call that ensure certain coherency between different blocks.

Exposing some of the architecture specific instruction using PTX is a good first attempt. I have’nt used inline PTX before. Are there any examples in the CUDA release related to the inline PTX features ?

I strongly believe persistent caching/shared memory will give higher benefits for program that needs exclusive access to the GPU card. This exclusive configuration allows only one context to run on the GPU (no display driver or other contexts). It is highly relevant in applications running exclusively on the GPU card and expects that the L1 and L2 cache contents are preserved between successive kernel launch. Also the kernel thread scheduler ensures that the same set of block are mapped back to the same multiprocessor (thread locality is preserved across launches). A broad category of simulation mechanism fall under this category and benefits from persistent L1 and shared memory allocation.

The CUDA 4.0 release candidate comes with a PDF in the toolkit doc/ directory called Using_Inline_PTX_Assembly_In_CUDA.pdf.

Given the general shift in CUDA development has been better support for non-exclusive access to the device, I don’t think this will be a priority for NVIDIA. (Employees do read the forums though.)

Keep in mind, though, that you can repopulate the shared memory from global memory in ~500 shader clock cycles. (Depends on what the latency is these days with GDDR5 and all the cache levels.) If the rest of your kernel runtime is short enough for that to be a significant contribution, then I think the kernel launch overhead (which is more like 10000 shader clock cycles) will be a bigger bottleneck.