Measuring PTX `st.wt` throughput

Based on my understanding, st.wt updates all levels of memory hierarchies, including L2 and global memory. However, when I use ncu to measure the memory workload, I don’t think the number of bytes write to the DRAM is on par with the program semantic.

Would like to check if my understanding of st.wt is right and how does ncu profiles it.

The st.wt instruction is specifically for System Memory, not Device Memory. See here PTX ISA :: CUDA Toolkit Documentation in Table 28. Cache Operators for Memory Store Instructions. Is that in line with your understanding or does that clarify the issue you are seeing? If not, if you can share an Nsight Compute report and your observations/questions, that might be useful.

Thanks for the clarification. That makes sense.

But what if the address is a device memory address but not the system memory address? Will nsight compute still count it as a device memory write?

See this chart, the store operations have the .wt modifier but they are still on L2 after the kernel ends. I thought .wt will trigger L2 → device memory writes based on it’s name “write-through”.

.wt will trigger L2 → device memory writes based on it’s name “write-through”.

This is not true. The .wt only applies to system memory, as described in that link. So it will not have an affect on Device Memory.

1 Like