I’m not sure where to point this out, so I’m doing so here.
The documentation for the CUDA profiler is out-of-date, which means it is wrong in a number of places.
Here it states:
gld_throughput = ((128 * global_load_hit) + (l2_subp0_read_requests + l2_subp1_read_requests) * 32 - (l1_local_ld_miss * 128)) / gputime
But l2_subp0_read_requests, l2_subp1_read_requests and l1_local_ld_miss are incorrect. Only after much frustration and poking around nvprof did I work out that these should in fact be: l2_subp0_read_sector_queries, l2_subp1_read_sector_queries and l1_local_load_miss.
It would also be useful if standard units were supplied along with a lot of the metrics (e.g. GB/s or GiB/s or “instructions per second”). This would make it much easier for newcomers to understand what various metrics mean.
I’m sure I’m not the only one who has experienced this annoyance, so I wanted to know if there is a “proper” way of letting NVIDIA know of this issue?