Measuring PTX `st.wt` throughput

Keren-Zhou · February 22, 2023, 8:29am

Based on my understanding, st.wt updates all levels of memory hierarchies, including L2 and global memory. However, when I use ncu to measure the memory workload, I don’t think the number of bytes write to the DRAM is on par with the program semantic.

Would like to check if my understanding of st.wt is right and how does ncu profiles it.

jmarusarz · February 23, 2023, 10:29pm

The st.wt instruction is specifically for System Memory, not Device Memory. See here PTX ISA :: CUDA Toolkit Documentation in Table 28. Cache Operators for Memory Store Instructions. Is that in line with your understanding or does that clarify the issue you are seeing? If not, if you can share an Nsight Compute report and your observations/questions, that might be useful.

Keren-Zhou · February 23, 2023, 10:59pm

Thanks for the clarification. That makes sense.

But what if the address is a device memory address but not the system memory address? Will nsight compute still count it as a device memory write?

See this chart, the store operations have the .wt modifier but they are still on L2 after the kernel ends. I thought .wt will trigger L2 → device memory writes based on it’s name “write-through”.

jmarusarz · March 9, 2023, 9:57pm

.wt will trigger L2 → device memory writes based on it’s name “write-through”.

This is not true. The .wt only applies to system memory, as described in that link. So it will not have an affect on Device Memory.

Topic		Replies	Views
Profiling device memory bandwidth utilization Nsight Compute	5	2645	September 5, 2022
Tracking particular memory addresses while profiling Nsight Compute	2	632	July 9, 2019
Nsight->unguided application->kernel memory meaning? CUDA Programming and Performance	4	738	September 12, 2016
Device memory store sectors Nsight Compute	4	518	September 21, 2023
L2 cache rate profiled in nsight compute is confused Nsight Compute	5	2723	July 3, 2024
How to calculate the total count of bytes were loaded/stored from/to global device memory? Nsight Compute performance-metrics	2	881	February 21, 2022
Measuring peak read/write bandwidth across device memory Nsight Compute	1	631	May 19, 2020
Mismatch in L2 load miss and Device Memory loads Nsight Compute	2	410	March 20, 2024
Nsight Compute: discrepancy in cache reports for OptiX applications Nsight Compute	8	608	July 13, 2021
Calculation of Memory Bound nature vs Roofline numbers Nsight Compute	3	908	May 18, 2023

Measuring PTX `st.wt` throughput

Related topics