The questions are related. Perhaps I did not phrase them clearly. Sorry about that.
I first asked about a cache line flush instruction. (My requirement was to be able to write a data pattern to a memory location immediately, instead of waiting for L2 writeback). When I did not find any I looked into the PTX ISA doc and found the st.wt instruction which is said to write through the L2 cache. I thought that this could be used to get data out to the memory immediately after it is written using the store instruction. Yes, the line will remain in the L2 but it will also get written into memory immediately, which could work for me.
My second question was about the “system memory” that st.wt is supposed to write the data to. I was not sure if the doc meant GPU memory or host memory.
System memory is what you get when you do cudaHostAlloc.
I don’t plan to use cudaHostAlloc(). I will be using cudaMalloc*(). The doc says that st.wt does a cache write-through to system memory. It was not clear to me how, when writing to a memory location allocated using cudaMalloc*(), the data will go to the system (or host) memory. I expected it to go to the memory in the GPU because that is where it is allocated.