disable L1 cache on Fermi GPU running OpenCL

Hi everyone,

According to the CUDA programming guide, the L1 cache on a Fermi GPU can be disabled (I’m using a GTX 580) by setting the nvcc compiler flag -Xptxas -dlcm=cg. There is no problem when compiling CUDA code with the flag set, however, if I am using OpenCL, can I do it in the same way? If yes, how shall I do it? I’m not sure if the flag are only available for CUDA compilation at present.

Since I am a newbie to both OpenCL and Fermi, this question might be silly to you. Anyway, any suggestion or clarification is welcome!


This would be really useful to check that the performance doesnt drop through the floor with a non fermi devices which haveno cache:-)


If there is such a compiler extension, it is not documented in the “OpenCL Compiler Extenssions” text file, much like the -cl-nv-arch compiler flag.
Let me know if you find it :).

David, there’re more reasons to disable L1 Cache on Fermi. The L1 Cache supports reads in 128bit alignment only. if your kernels are 32 or 64 bit aligned, you might suffer from coalescing issues.

I’m afraid this is a misinterpretation. The L1 always performs naturally aligned 128 Byte (not bit!) read accesses to L2! But it supports quite arbitrary accesses from the compute cores (the same bank conflict rules as for shared memory apply, because it’s the same piece of hardware). In contrast to L2, which has a cache line size of 32B, L1 has a cache line of 128B, an every read miss will fetch a whole 128B line, if necessary from global memory. Therefore bypassing L1 is beneficial for memory accesses that are scattered or have a long stride, because a memory access will fetch only 32B from device memory (allocated in L2) instead of 128B (allocated in L1 and most probably in L2).

See also CUDA C programming Guide 3.2, G.4.2.

Got it, thanks for the clarification.

Could you clarify a little bit more for me?

I understand that when memory accesses are scattered, and L1 is ON, a block of 128B sized memory will be fetched into L1. But where this data is fetched depends on whether it is cached in L2 or not, right? So in both cases, what is the penalties for having L1 ON? Is it just the overhead of fetching data that is not potential to be reused in future? Anything else?

By the way, sorry to have probably irrelevant question to the topic. Do you know what cache algorithm do they use for L1 and L2 in Fermi? Are they K-way associative caches? Is it possible to find out K?


The L1 cache line size is 128 Byte, so an L1 cache miss that can’t get the full 128 Byte from L2 will fetch 128Byts from global memory. A L2 cache miss as the same access patterns as gt200 coalescing (32, 64 or 128 Bytes). So if you get a lot of L1 misses that also miss at L2 level, you could be better off turning off L1.

I don’t know of a flag that does that, but rumor has it that marking the variable as volatile will cause it to skip L1 and stay at L2 level (at least under CUDA).

I couldn’t find it. Anyone? I would like to try this too.


There is no such compiler extension at the moment. The only thing you can do is mark the pointer as volatile. I did some tests and it does seem to work.