Disabling L1 cache in visual studio

Hi,
I’m using Visual Studio 2019 and cuda programming to compile and build my my app. in my code, I need to read from memory in a misaligned and uncoalesced way (so, there is not specific pattern). To improve the performance, i need to disable L1 cache. I know that in nvcc compiler, it can be done with "Xptxas -dlcm=cg ". However, i want to do it in visual stduio (i have many linked .cu files and visual studio compile them easier compared to using the command prompt). I have already put this command in the “properties”>“linker”> “command line”, but i get the following error:
Capture

if i put it on “properties”>“CUDA linker”> “command line”, nothing changes when I nvprof the app. So, please let me know how can i disable L1 cache.

Thanks,
Moein.

You certainly don’t want to do that in CUDA device code.

Regarding your question:

This is a compile-time option, not a link-time option. When I add:

-Xptxas -dlcm=cg

in the Additional Options box in Project…Properties…Configuration Properties…CUDA C/C++…Command Line, things compile properly.

It’s not clear what changes you expect to see when you profile the app. To confirm whether or not the compile option had any effect, it would be necessary to inspect the generated machine code. I would use

cuobjdump -sass myexe.exe

from the command line, to do this, comparing the case with and without the compile option.

When I do that with the default app created in visual studio 2019 when you create a new CUDA runtime project, I observe that the cuobjdump tool reports:

Fatbin ptx code:
================
arch = sm_61
code version = [7,1]
producer = <unknown>
host = windows
compile_size = 64bit
compressed
ptxasOptions = -dlcm=cg

and I observe the presence of instructions like this:

    /*0068*/                   LDG.E.CG R4, [R4] ;  

which confirm that the option was respected.

Thank you for your reply. well, I have to do this as this comes from the nature of the algorithm i’m implementing. I’m working on using shared memory to mitigate the complexity.

Now that i have used “cuobjdump -sass myexe.exe”, I can see that this line in the results when i use “-Xptxas -dlcm=cg”. So, we are good with this.

I expected to have some (even a bit) speed up for my app, but i saw no changes in the processing time (I used clock() before and after the Kernel). I also used "nvprof --metrics gld_efficiency,gld_transactions TUI_CUDA ", but again no change was observed in gld_efficiency and gld_transactions.
I thought the goal of adding “-Xptxas -dlcm=cg” was to decrease the global load transaction (gld_transactions), also leading to faster processing, when the load from memory is misaligned. isn’t it the case? am i missing something here?!

The option does not automatically speed up any code. In the general case, I would expect usage of that option to slow things down. However, when the activity of your code is dominated by scattered global load access (not the same thing as misaligned access which is illegal), then its possible that the usage of that option will speed things up. YMMV, it will be very code-dependent. Also, some architectures (e.g. Kepler) have L1 disabled by default.

1 Like

Yes, probably I’m mistaking by calling it misaligned. My program is dominated with scattered global load access (multiple threads in a warp request to access different memory locations), and I think there should be some changes when i use “-Xptxas -dlcm=cg”, but i see none. My GPU is GTX950M, which is based on Nvidia Maxwell architecture. Is L1 cache disabled bu default in Nvidia Maxwell architecture?

https://docs.nvidia.com/cuda/kepler-tuning-guide/index.html#kepler-tuning

https://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html#maxwell-tuning

Another general possibility of course is that global loads are not a performance limiter for your application.

Different threads in a warp accessing different memory locations isn’t yet scattered access. The common CUDA-typical “base+tid” addressing pattern already has that property. I assume you meant the threads in a warp are accessing non-contiguous data objects widely spread out throughout global memory, i.e. poor locality of intra-warp memory access.

yes, that is what i mean.
In here, I found this: “As with Kepler, global loads in Maxwell are cached in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler.” So, does this mean that L1 cache is disabled by default in Nvidia Maxwell architecture?

I think speaking of L1 cache being “disabled” does not do justice to the nuances of the specification. What the docs state is that loads from global memory fall into (at least) two different classes, one of which can be cached in L1 cache while the other(s) cannot.

Specifically it says that loads through the LDG-mechanism (generally applicable to read-only data) are cacheable in L1. If you look at the disassembled binary (e.g. cuobjdump --dumpsass), you should be able to discern whether that applies to your code, as such loads use the LDG instruction.