Flushing dirty L2 cache lines

,

Hi,

I have seen NV_UFLUSH_L2_FLUSH_DIRTY and MEM_OP_D_OPERATION_L2_FLUSH_DIRTY in NVIDIA Driver.
I have found this in one of the drivers, 390.48. I checked it is available on later versions as well (470, open-source drivers as well).
For example, one of the file that contains this definition is clc06f.h inside kernel/nvidia-uvm folder.

I wonder if these definitions can be utilized to flush dirty lines in GPU L2 cache.
Also if possible, how can I use them in my program?

Thanks in advance,
Ravan.

Currently, there isn’t any method in CUDA to flush L2 cache lines. If you have questions about the GPU linux driver, there is a separate forum for that.

Since the L2 is a device-wide proxy for GPU DRAM memory (logical local and global spaces), its not immediately obvious to me why flushing the L2 would be needed.

I need a way to bypass the L2 cache so that access is served directly from the GPU DRAM memory.
Yes my question is regarding the driver, not cuda API.

Thanks.

For what purpose? Maybe there are other ways of (approximately) accomplishing whatever it is.

In that case, the logic of asking in a subforum emtitled “CUDA - CUDA Programming and Performance” escapes me.

1 Like

I just need each access to be served from DRAM.

I was actually directed to this forum from another forum:)
I reasked this question in the above-mentioned forum.

Why is that? For what purpose? How does normal L2 cache operation interfere with that purpose?

for my research

Your best bet is to use the ptx cache modifier. However, they are only a hint.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators

.cv Don’t cache and fetch again (consider cached system memory lines stale, fetch again).
The ld.cv load operation applied to a global System Memory address invalidates (discards) a matching L2 line and re-fetches the line on each new load.

For sm_70 and newer, there is also

no_allocate Do not allocate data to cache. This priority is suitable for streaming data.

Actually, I have tried these approaches.
no_allocate seems to be NOT supported for L2 caches. I have tried L2::no_allocate, but compilation failed.
I don’t want to access system memory, but the device DRAM.

Thanks.