Hey,
I’m executing a single kernel 10 times under nvprof to attain memory usage statistics. However, sometimes I get the following output:
==761== Profiling result:
==761== Event result:
Invocations Event Name Min Max Avg
Device "GK20A (0)"
Kernel: gpu_dct_kernel(unsigned char*, short*, int, int, int)
60 gld_inst_8bit 0 0 0
60 gld_inst_16bit 0 0 0
60 gld_inst_32bit 0 0 0
60 gld_inst_64bit 0 0 0
60 gld_inst_128bit 0 0 0
60 gst_inst_8bit 0 0 0
60 gst_inst_16bit 0 0 0
60 gst_inst_32bit 0 0 0
60 gst_inst_64bit 0 0 0
60 gst_inst_128bit 0 0 0
==761== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "GK20A (0)"
Kernel: gpu_dct_kernel(unsigned char*, short*, int, int, int)
60 ipc Executed IPC 2.059172 2.072671 2.067245
60 gld_requested_throughput Requested Global Load Throughput 0.00000B/s 0.00000B/s 0.00000B/s
60 gst_requested_throughput Requested Global Store Throughput 0.00000B/s 0.00000B/s 0.00000B/s
60 gst_throughput Global Store Throughput 15.928MB/s 15.971MB/s 15.956MB/s
60 gld_throughput Global Load Throughput 159.28MB/s 159.71MB/s 159.56MB/s
60 warp_execution_efficiency Warp Execution Efficiency 100.00% 100.00% 100.00%
60 issued_ipc Issued IPC 2.283682 2.304195 2.297223
60 gld_transactions Global Load Transactions 48000 48000 48000
60 gst_transactions Global Store Transactions 2400 2400 2400
60 stall_exec_dependency Issue Stall Reasons (Execution Dependenc 62.64% 63.17% 62.92%
60 stall_memory_dependency Issue Stall Reasons (Data Request) 14.24% 14.94% 14.55%
60 stall_other Issue Stall Reasons (Other) 4.48% 4.54% 4.51%
60 ldst_fu_utilization Load/Store Function Unit Utilization Low (2) Low (2) Low (2)
60 alu_fu_utilization Arithmetic Function Unit Utilization Mid (5) Mid (5) Mid (5)
60 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
60 inst_executed Instructions Executed 1430400 1430400 1430400
60 inst_issued Instructions Issued 1589156 1590099 1589556
60 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.61% 0.62% 0.62%
As you can see from the output above, nvprof has not been able to count any global loads and stores, and the [gst/gld]_requested_throughput metrics are therefore incorrect. However, the [gst/gld]_throughput metrics always show more realistic bandwidth usage. Normal output should look more like this:
==3004== Profiling result:
==3004== Event result:
Invocations Event Name Min Max Avg
Device "GK20A (0)"
Kernel: gpu_dct_kernel(unsigned char*, short*, int, int, int)
60 gld_inst_8bit 76800 76800 76800
60 gld_inst_16bit 0 0 0
60 gld_inst_32bit 0 614400 71680
60 gld_inst_64bit 0 0 0
60 gld_inst_128bit 0 0 0
60 gst_inst_8bit 0 0 0
60 gst_inst_16bit 0 76800 6400
60 gst_inst_32bit 0 0 0
60 gst_inst_64bit 0 0 0
60 gst_inst_128bit 0 0 0
==3004== Metric result:
Invocations Metric Name Metric Description Min Max Avg
Device "GK20A (0)"
Kernel: gpu_dct_kernel(unsigned char*, short*, int, int, int)
60 ipc Executed IPC 0.761586 0.823856 0.796783
60 gld_requested_throughput Requested Global Load Throughput 13.655MB/s 457.50MB/s 65.423MB/s
60 gst_requested_throughput Requested Global Store Throughput 0.00000B/s 27.831MB/s 2.3047MB/s
60 gst_throughput Global Store Throughput 27.310MB/s 27.843MB/s 27.618MB/s
60 gld_throughput Global Load Throughput 273.10MB/s 278.43MB/s 276.18MB/s
60 warp_execution_efficiency Warp Execution Efficiency 100.00% 100.00% 100.00%
60 issued_ipc Issued IPC 0.848035 0.913218 0.885046
60 gld_transactions Global Load Transactions 48000 48000 48000
60 gst_transactions Global Store Transactions 2400 2400 2400
60 stall_exec_dependency Issue Stall Reasons (Execution Dependenc 21.32% 24.03% 23.07%
60 stall_memory_dependency Issue Stall Reasons (Data Request) 65.55% 69.56% 66.85%
60 stall_other Issue Stall Reasons (Other) 1.37% 1.54% 1.48%
60 ldst_fu_utilization Load/Store Function Unit Utilization Low (1) Low (1) Low (1)
60 alu_fu_utilization Arithmetic Function Unit Utilization Low (2) Low (2) Low (2)
60 cf_fu_utilization Control-Flow Function Unit Utilization Low (1) Low (1) Low (1)
60 inst_executed Instructions Executed 1430400 1430400 1430400
60 inst_issued Instructions Issued 1585427 1586274 1585823
60 stall_memory_throttle Issue Stall Reasons (Memory Throttle) 0.22% 0.38% 0.24%
It looks like nvprof is not able to count the number of load / store instructions correctly.
Finally, I am trying to measure the actual DRAM bandwidth, i.e. the actual bandwidth usage writing and reading physical global memory. My platform is the Jetson-TK1, so of course this is host-shared memory in my case. Does anyone have any input on how this can be done?