[Jetson-TK1] nvprof, hardware performance counters and actual DRAM bandwidth usage

Hey,

I’m executing a single kernel 10 times under nvprof to attain memory usage statistics. However, sometimes I get the following output:

==761== Profiling result:
==761== Event result:
Invocations                                Event Name         Min         Max         Avg
Device "GK20A (0)"
	Kernel: gpu_dct_kernel(unsigned char*, short*, int, int, int)
         60                             gld_inst_8bit           0           0           0
         60                            gld_inst_16bit           0           0           0
         60                            gld_inst_32bit           0           0           0
         60                            gld_inst_64bit           0           0           0
         60                           gld_inst_128bit           0           0           0
         60                             gst_inst_8bit           0           0           0
         60                            gst_inst_16bit           0           0           0
         60                            gst_inst_32bit           0           0           0
         60                            gst_inst_64bit           0           0           0
         60                           gst_inst_128bit           0           0           0

==761== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GK20A (0)"
	Kernel: gpu_dct_kernel(unsigned char*, short*, int, int, int)
         60                                       ipc                              Executed IPC    2.059172    2.072671    2.067245
         60                  gld_requested_throughput          Requested Global Load Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         60                  gst_requested_throughput         Requested Global Store Throughput  0.00000B/s  0.00000B/s  0.00000B/s
         60                            gst_throughput                   Global Store Throughput  15.928MB/s  15.971MB/s  15.956MB/s
         60                            gld_throughput                    Global Load Throughput  159.28MB/s  159.71MB/s  159.56MB/s
         60                 warp_execution_efficiency                 Warp Execution Efficiency     100.00%     100.00%     100.00%
         60                                issued_ipc                                Issued IPC    2.283682    2.304195    2.297223
         60                          gld_transactions                  Global Load Transactions       48000       48000       48000
         60                          gst_transactions                 Global Store Transactions        2400        2400        2400
         60                     stall_exec_dependency  Issue Stall Reasons (Execution Dependenc      62.64%      63.17%      62.92%
         60                   stall_memory_dependency        Issue Stall Reasons (Data Request)      14.24%      14.94%      14.55%
         60                               stall_other               Issue Stall Reasons (Other)       4.48%       4.54%       4.51%
         60                       ldst_fu_utilization      Load/Store Function Unit Utilization     Low (2)     Low (2)     Low (2)
         60                        alu_fu_utilization      Arithmetic Function Unit Utilization     Mid (5)     Mid (5)     Mid (5)
         60                         cf_fu_utilization    Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
         60                             inst_executed                     Instructions Executed     1430400     1430400     1430400
         60                               inst_issued                       Instructions Issued     1589156     1590099     1589556
         60                     stall_memory_throttle     Issue Stall Reasons (Memory Throttle)       0.61%       0.62%       0.62%

As you can see from the output above, nvprof has not been able to count any global loads and stores, and the [gst/gld]_requested_throughput metrics are therefore incorrect. However, the [gst/gld]_throughput metrics always show more realistic bandwidth usage. Normal output should look more like this:

==3004== Profiling result:
==3004== Event result:
Invocations                                Event Name         Min         Max         Avg
Device "GK20A (0)"
        Kernel: gpu_dct_kernel(unsigned char*, short*, int, int, int)
         60                             gld_inst_8bit       76800       76800       76800
         60                            gld_inst_16bit           0           0           0
         60                            gld_inst_32bit           0      614400       71680
         60                            gld_inst_64bit           0           0           0
         60                           gld_inst_128bit           0           0           0
         60                             gst_inst_8bit           0           0           0
         60                            gst_inst_16bit           0       76800        6400
         60                            gst_inst_32bit           0           0           0
         60                            gst_inst_64bit           0           0           0
         60                           gst_inst_128bit           0           0           0

==3004== Metric result:
Invocations                               Metric Name                        Metric Description         Min         Max         Avg
Device "GK20A (0)"
        Kernel: gpu_dct_kernel(unsigned char*, short*, int, int, int)
         60                                       ipc                              Executed IPC    0.761586    0.823856    0.796783
         60                  gld_requested_throughput          Requested Global Load Throughput  13.655MB/s  457.50MB/s  65.423MB/s
         60                  gst_requested_throughput         Requested Global Store Throughput  0.00000B/s  27.831MB/s  2.3047MB/s
         60                            gst_throughput                   Global Store Throughput  27.310MB/s  27.843MB/s  27.618MB/s
         60                            gld_throughput                    Global Load Throughput  273.10MB/s  278.43MB/s  276.18MB/s
         60                 warp_execution_efficiency                 Warp Execution Efficiency     100.00%     100.00%     100.00%
         60                                issued_ipc                                Issued IPC    0.848035    0.913218    0.885046
         60                          gld_transactions                  Global Load Transactions       48000       48000       48000
         60                          gst_transactions                 Global Store Transactions        2400        2400        2400
         60                     stall_exec_dependency  Issue Stall Reasons (Execution Dependenc      21.32%      24.03%      23.07%
         60                   stall_memory_dependency        Issue Stall Reasons (Data Request)      65.55%      69.56%      66.85%
         60                               stall_other               Issue Stall Reasons (Other)       1.37%       1.54%       1.48%
         60                       ldst_fu_utilization      Load/Store Function Unit Utilization     Low (1)     Low (1)     Low (1)
         60                        alu_fu_utilization      Arithmetic Function Unit Utilization     Low (2)     Low (2)     Low (2)
         60                         cf_fu_utilization    Control-Flow Function Unit Utilization     Low (1)     Low (1)     Low (1)
         60                             inst_executed                     Instructions Executed     1430400     1430400     1430400
         60                               inst_issued                       Instructions Issued     1585427     1586274     1585823
         60                     stall_memory_throttle     Issue Stall Reasons (Memory Throttle)       0.22%       0.38%       0.24%

It looks like nvprof is not able to count the number of load / store instructions correctly.

Finally, I am trying to measure the actual DRAM bandwidth, i.e. the actual bandwidth usage writing and reading physical global memory. My platform is the Jetson-TK1, so of course this is host-shared memory in my case. Does anyone have any input on how this can be done?

I am ignorant on nvprof, but when I see any kind of inability to access something under debuggers or symbol browsing (which includes profilers) I tend to think of compilers running optimizations which inline or otherwise prevent accessing the code as a separate call. A CUDA kernel is an interesting twist to add in to this, but I’m wondering if your original compile may have done some kind of implicit inline optimization. Although not an answer to your real question, have you compiled your kernel program with all debugging/profiling on, and also all implicit inline forcibly disabled?

I can try to disable optimisations, thanks for the tip. I am currently compiling with debugging on, but have not seen any profiling options for nvcc?

Running nvprof with “-e all” to collect absolutely all hardware performance counters, I see that these are increasing more logically (when I compare with what I know must be transferred:

60        l2_subp0_total_read_sector_queries       48028       48110       48069
         60       l2_subp0_total_write_sector_queries        4806        5042        4809

In the documentation:

l2_subp0_total_read_sector_queries:  Total read requests to slice 0 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

l2_subp0_total_write_sector_queries:  Total write requests to slice 0 of L2 cache. This includes requests from  L1, Texture cache, system memory. This increments by 1 for each 32-byte access.

Does this mean that these counters actually reflect the DRAM <-> L2 Cache bandwith?