L1 Cache, L2 Cache and Shared memory in Fermi

jarjar · March 19, 2011, 1:51am

Couple of question on these issues:-

(1) Are L1 caches flushed out after a kernel finishes its execution or is it copied back to L2 without flushing the L1 cache ? What kind of consistency is available for the data written into L1 cache ?

(2) From a CUDA program, can we load/store some data structure from/to global memory by by-passing L1/L2 caching during runtime ? This is significantly important for many applications that contains few large data structure (of size 100MB or more), and these data structure does not have any locality, and can hurt the existing L2 cached contents. PTX 2.3 manual page 110 has some cache streaming instruction, but it might be beneficial to expose cache operator using C intrinsic functions.

int global_mem[100];

__global__ void kernel()

{

   __shared__ int mem[10];

for (int i=threadIdx.x; i < 100; i += blockDim.x) 

       mem[i] = load_streaming ( global_mem[i]);

}

If a CUDA platform does not have this streaming function implemented than the load become like a regular load. This can be easily ignored by the compiler when streaming is not supported.

(3) Is there any possibility to have persistent shared memory or L1 cache content ? For example, assume that a given application can launch and execute a kernels with 32 blocks concurrently. Is it possible to ensure that during the next launch of the same kernel (again with 32 blocks of threads), they are executed on the same multiprocessors (block mapped to the same processor), and the shared memory/L1 cache contents are preserved ? This can potentially reduce significant global memory accesses and also reduce kernel launch overhead ?

Thanks.

hyqneuron · March 19, 2011, 4:12pm

Programming guide 3.2, 3.2.5.2 mentions page-locked memory can be allocated with the flag cudaHostAllocWriteCombined, therefore avoid using of L1 and L2 cache.

G4.1: The cache behavior (e.g. whether reads are cached in both L1 and L2 or in L2 only) can be partially configured on a per-access basis using modifiers to the load or store instruction.

Looks like it’s possible… i’m still digging…

hyqneuron · March 19, 2011, 4:30pm

apparently I didn’t get your question… why don’t you just edit the ptx file generated to change the cache operators?

for question 3, unless you do a lot of launches and do very little thing in each launch, filling up the cache would not take up a significant part of the kernel execution time, would it?

For question one, you can create a test kernel which makes use of global memory at fixed address only once and then returns. launch the kernel many times and time the total execution time. Though you may want to launch a relatively time-consuming kernel first before you start launching the test kernel, because the first test kernel execution may have finished before the next launch command arrives at the GPU - so you use a long kernel before the test kernels to queue up the test kernel launches at the GPU. In one version of the kernel, you can do all the loads with .cs, and in another do all the loads with .ca, and compare the total execution time.

seibert · March 19, 2011, 4:58pm

The only guarantee that CUDA makes by default is that writes to global memory are flushed by the end of kernel execution. However, you can modify that in several ways (some already mentioned):

Use inline PTX to directly apply modifiers that specify the load or store instruction should bypass L1.
Use nvcc command line option “-Xptxas -dlcm=cg” to specify that all global reads and writes in the kernel should bypass L1. This is not as helpful if your goal is to stream some reads and not others.
You can force the L1 cache to flush back up the memory hierarchy using the appropriate _threadfence*() function. __threadfence_block() requires that all previous writes have been flushed to shared memory and/or the L1. __threadfence() additionally forces global memory writes to be visible to all blocks, and so must flush writes up to the L2. Finally, __threadfence_system() flushes up to the host level for mapped memory.

Agreed. In CUDA 4.0, inline PTX will be officially supported (it was available before, but undocumented), and so it will be possible to embed the modified read instruction in your C code. A C-level modifier would be nice, but it looks like the CUDA C language designers are reluctant to introduce additional deviations from the C language. Maybe someday…

Short answer: no.

Long answer: The cache content at the start of a kernel and mapping of blocks to multiprocessors is not defined. The CUDA driver is free to update the display and run kernels from other contexts between kernel calls from your program. Any of those things could modify the cache contents. Similarly, on some devices CUDA allows multiple kernels to be active at the same time, which would prevent a fixed block to multiprocessor mapping.

As mentioned below, if the lack of persistent caching across kernel calls is a bottleneck for your code, you should try to do more calculations per memory read in each kernel call.

jarjar · March 21, 2011, 9:17pm

Seibert, hyqneuron,

Thanks for the replies.

I especially appreciated the importance of threadfence() function call that ensure certain coherency between different blocks.

Exposing some of the architecture specific instruction using PTX is a good first attempt. I have’nt used inline PTX before. Are there any examples in the CUDA release related to the inline PTX features ?

I strongly believe persistent caching/shared memory will give higher benefits for program that needs exclusive access to the GPU card. This exclusive configuration allows only one context to run on the GPU (no display driver or other contexts). It is highly relevant in applications running exclusively on the GPU card and expects that the L1 and L2 cache contents are preserved between successive kernel launch. Also the kernel thread scheduler ensures that the same set of block are mapped back to the same multiprocessor (thread locality is preserved across launches). A broad category of simulation mechanism fall under this category and benefits from persistent L1 and shared memory allocation.

seibert · March 21, 2011, 10:23pm

The CUDA 4.0 release candidate comes with a PDF in the toolkit doc/ directory called Using_Inline_PTX_Assembly_In_CUDA.pdf.

Given the general shift in CUDA development has been better support for non-exclusive access to the device, I don’t think this will be a priority for NVIDIA. (Employees do read the forums though.)

Keep in mind, though, that you can repopulate the shared memory from global memory in ~500 shader clock cycles. (Depends on what the latency is these days with GDDR5 and all the cache levels.) If the rest of your kernel runtime is short enough for that to be a significant contribution, then I think the kernel launch overhead (which is more like 10000 shader clock cycles) will be a bigger bottleneck.

Topic		Replies	Views
Managing Constant Memory CUDA Programming and Performance	9	5213	November 19, 2010
Fermi Cache Architecture Cache, write policy, read policy, architecture CUDA Programming and Performance	6	3220	August 31, 2011
Custom CPU to GPU ringbuffer CUDA Programming and Performance	21	13747	May 14, 2013
How to optimize for cache + shared memory on Fermi? CUDA Programming and Performance	8	3038	April 25, 2010
Some advice needed pls Doubts we have, we're starting with CUDA programming CUDA Programming and Performance	16	4698	June 22, 2011
Disabling cache on Fermi architectures Try to disable L1 and L2 CUDA Programming and Performance	11	9259	August 30, 2013
Fermi question CUDA Programming and Performance	30	5553	May 26, 2010
Question on the L1 caching of the GK 110 CUDA Programming and Performance	17	7153	April 17, 2013
CUDA 4.0 CUDA Programming and Performance	63	507400	March 28, 2013
L2 cache (.cg) memory load performance CUDA Programming and Performance	6	1687	January 5, 2017

L1 Cache, L2 Cache and Shared memory in Fermi

Related topics