Cache line flush

I have a code that uses the x86 CLFLUSH instruction to flush the line containing a specific address, from all caches. Is it possible to do this in the GPU? I did not find any equivalent PTX instruction to flush a cache line. I would like to be able to do the following

  • read from or write to an address
  • flush the line containing the address from all caches


Trying to build a GPU rowhammer?

Something like that.

Looks like the .wt cache operator for store instructions writes through the L2 cache, to the memory. However, I am bit confused with the terminology used in the PTX ISA document :

.wt   Cache write-through (to system memory).
The st.wt store write-through operation applied to a global System Memory 
address writes through the L2 cache. 

Source : PTX ISA :: CUDA Toolkit Documentation

What do the terms “system memory” and “global System Memory” refer to? Do they mean the memory on the GPU or the memory on the host? My guess is that it is the GPU memory but the terms used are not very clear to me.


Slide 7 of the following document

says that System Memory is host memory. If I allocate memory using cudaMalloc() or cudaMallocManaged(), any writeback or write-through (using st.wt) or eviction from L2 must go to the GPU memory (and not host memory). Is this correct?

The slide also says that the “L2 does not cache system memory”.


I don’t know of anything you can do to selectively flush or invalidate cache lines. You now seem to be asking different questions.

system memory is usually referring to host memory from the perspective of the GPU. global refers to a logical space. System memory is mapped within the logical global space. system memory is also referred to as zero copy. System memory is what you get when you do cudaHostAlloc.

neither cudaMalloc nor cudaMallocManaged allocate system memory. Yes, afaik, system memory accesses are not cached in L2. You can write a fairly simple demonstrator to prove this. (They may be cached in L1, however.)

I don’t think I understand the remainder of your questions.

The questions are related. Perhaps I did not phrase them clearly. Sorry about that.

I first asked about a cache line flush instruction. (My requirement was to be able to write a data pattern to a memory location immediately, instead of waiting for L2 writeback). When I did not find any I looked into the PTX ISA doc and found the st.wt instruction which is said to write through the L2 cache. I thought that this could be used to get data out to the memory immediately after it is written using the store instruction. Yes, the line will remain in the L2 but it will also get written into memory immediately, which could work for me.

My second question was about the “system memory” that st.wt is supposed to write the data to. I was not sure if the doc meant GPU memory or host memory.

System memory is what you get when you do cudaHostAlloc.

I don’t plan to use cudaHostAlloc(). I will be using cudaMalloc*(). The doc says that st.wt does a cache write-through to system memory. It was not clear to me how, when writing to a memory location allocated using cudaMalloc*(), the data will go to the system (or host) memory. I expected it to go to the memory in the GPU because that is where it is allocated.


Although only a hint and requiring sm >= 7.0, is the “no_allocate” parameter of any use here?

I think it could. I am looking into it also.