Disabling cache on Fermi architectures Try to disable L1 and L2

CNugteren · September 10, 2010, 12:48pm

In relation to some research, I am trying to disable caches on my Fermi card (GTX470).

So far, I succeeded in disabling the complete level 1 caches by using the following compiler flag for nvcc:
-Xptxas -dlcm=cg
This decreases the performance, so I assume the level 1 cache is actually disabled. However, there is no manual or help on the ‘dlcm’ option, other than that it supports the values ca (enable L1) and cg (disable L1).

Secondly, I tried to disable the L2 cache, but so far without success. Is there any information on this topic available? A solution for me would be welcome in one of the following ways:

As a compiler flag (similar to disabling L1 cache)
As a function in the CUDA (host/kernel) code
As a workaround (tricking the compiler not to cache)

Input is welcome!

MisterAnderson42 · September 10, 2010, 1:40pm

As far as I know, you cannot disable L2. On various Fermi architecture presentations, it has been stated that every memory transaction goes through L2. (i.e., this one: [url=“https://hub.vscse.org/resources/287/download/cuda_fermi_overview_DK.pdf”]https://hub.vscse.org/resources/287/downloa...overview_DK.pdf[/url] slide 12 - there is a video too [url=“http://groups.google.com/group/vscse-many-core-processors-2010/web/course-presentations”]http://groups.google.com/group/vscse-many-...e-presentations[/url] )

You might check the PTX manual and see if there is even an instruction to read from DRAM and not from L2.

MisterAnderson42 · September 10, 2010, 1:40pm

As far as I know, you cannot disable L2. On various Fermi architecture presentations, it has been stated that every memory transaction goes through L2. (i.e., this one: [url=“https://hub.vscse.org/resources/287/download/cuda_fermi_overview_DK.pdf”]https://hub.vscse.org/resources/287/downloa...overview_DK.pdf[/url] slide 12 - there is a video too [url=“http://groups.google.com/group/vscse-many-core-processors-2010/web/course-presentations”]http://groups.google.com/group/vscse-many-...e-presentations[/url] )

You might check the PTX manual and see if there is even an instruction to read from DRAM and not from L2.

CNugteren · September 10, 2010, 2:07pm

Thanks, that pointed me in the correct direction. From the PTX manual we have:

(there is a more detailed description in the manual on page 109 (ptx_isa_20)).

Luckily, these options translate to the -dlcm option for ptxas. When I try the ‘cs’ option, performance decreases again! From the description, I guess that this means we omit both caches:

CNugteren · September 10, 2010, 2:07pm

Thanks, that pointed me in the correct direction. From the PTX manual we have:

(there is a more detailed description in the manual on page 109 (ptx_isa_20)).

Luckily, these options translate to the -dlcm option for ptxas. When I try the ‘cs’ option, performance decreases again! From the description, I guess that this means we omit both caches:

Magorath · September 13, 2010, 9:15am

I’m using your topic as I’m currently doing the same analysis.

I do have a small kernel which reads number in 7 different arrays, does simple arithmetic with the numbers and fills 12 other arrays in global memory.
When I disable L1 cache, I get 20% better performance. I’m really struggling to interpret this correctly. Could someone maybe give some piece of advice ?

If I disable the L1 and L2 cache, I however lower the performance by ~10%.

I’m using a GTX 465 and all arrays are declared as restricted to help the compiler do its job.

Any help would be really appreciated.

Magorath · September 13, 2010, 9:15am

I’m using your topic as I’m currently doing the same analysis.

I do have a small kernel which reads number in 7 different arrays, does simple arithmetic with the numbers and fills 12 other arrays in global memory.
When I disable L1 cache, I get 20% better performance. I’m really struggling to interpret this correctly. Could someone maybe give some piece of advice ?

If I disable the L1 and L2 cache, I however lower the performance by ~10%.

I’m using a GTX 465 and all arrays are declared as restricted to help the compiler do its job.

Any help would be really appreciated.

MichealHou · May 23, 2011, 6:43am

maybe the L1 cache miss rate is too high. lots of time are wasted on check.

hyqneuron · May 23, 2011, 11:57am

I’m not quite sure about the latency of cache operations. But another thing is that when L1 is disabled L1 starts issuing 32-byte cache line accesses instead of the default 128-byte lines. That saves some global memory bandwidth if you do not read/write to continuous regions.

wlangdon · August 29, 2013, 3:19pm

the nvcc -Xptxas -dlcm appears to be applied by the CUDA compiler to
the kernel being compiled. But caches are global.
When I start another kernel (compiled without -Xptxas) will the
caches all revert to normal (ie their defaults).
Thanks
Bill

njuffa · August 29, 2013, 5:46pm

-Xptxas -dlcm does not cause machine state to be changed. It changes the code generation, so a different flavor of load instructions for accessing global memory is generated. Only the global load instructions in a given compilation unit are affected. One can change the load behavior for individual global memory accesses by generating the desired load instruction flavor via inline PTX.

wlangdon · August 30, 2013, 10:31am

Dear njuffa,
Thank you very much for rapid and helpful reply.
Bill

Topic		Replies	Views
disable L1 cache on Fermi GPU running OpenCL CUDA Programming and Performance	9	4116	September 4, 2011
Fermi Cache Architecture Cache, write policy, read policy, architecture CUDA Programming and Performance	6	3220	August 31, 2011
L1 Cache, L2 Cache and Shared memory in Fermi CUDA Programming and Performance	5	23532	March 21, 2011
Fermi L1 Cache coherent? CUDA Programming and Performance	5	14913	May 20, 2010
Understanding the functioning of nvprof and .cv data load option CUDA Programming and Performance	8	3071	December 11, 2014
How does cuda global memory's L1 caching work CUDA Programming and Performance	5	668	July 12, 2024
Switch off L1 cache CUDA Programming and Performance	2	3409	March 24, 2015
Question on the L1 caching of the GK 110 CUDA Programming and Performance	17	7153	April 17, 2013
Disabling L1 cache in visual studio CUDA Programming and Performance	8	837	February 11, 2021
Fermi: Cache configuration default at compile time From shared to L1 CUDA Programming and Performance	4	19525	April 16, 2010

Disabling cache on Fermi architectures Try to disable L1 and L2

Related topics