CUPTI mapping of SM to instance

RKCisco · July 29, 2015, 10:29pm

Sorry for what is most likely a newbie question, but I’m trying to understand how to associate per instance event/metric counters with a particular SM. I have multiple, long running kernels executing simultaneously. I know which SM(s) the kernel is executing on using “mov.u32 %0, %smid;” from within the kernel. I’ve been through the CUPTI API docs and haven’t seen any way to correlate instance to SMID. Any pointers greatly appreciated.

little_jimmy · July 30, 2015, 5:03am

" understand how to associate per instance event/metric counters with a particular SM"

could you explain exactly what per instance event/metric counters mean; and why it is important to your case?

RKCisco · July 30, 2015, 2:24pm

For example, here is the IPC metric (output of nvidia-smi --query-metrics for Tesla)

ipc_instance: Instructions executed per cycle for a single multiprocessor

I am using CUPTI to retrieve the IPC metric per instance (i.e. per SM). From the data I’ve gathered there is not a straight mapping of “instance” to “SM ID”. Here is what I am observing, for my two kernels, each of which launches 4 thread blocks of 512 threads:

Metric ipc_instance:
Instance 0: 0.000000
Instance 1: 0.000000
Instance 2: 0.154155
Instance 3: 0.000000
Instance 4: 0.000000
Instance 5: 0.336802
Instance 6: 0.000000
Instance 7: 0.153201
Instance 8: 0.340822
Instance 9: 0.000000
Instance 10: 0.153438
Instance 11: 0.343361
Instance 12: 0.000000
Instance 13: 0.156066
Instance 14: 0.341185

Kernel mappings reported by my kernels:
“response” thread block ran on SM 14
“response” thread block ran on SM 13
“response” thread block ran on SM 12
“response” thread block ran on SM 11
“request” thread block ran on SM 10
“request” thread block ran on SM 9
“request” thread block ran on SM 8
“request” thread block ran on SM 7

Based on these results, four thread blocks of one kernel (with IPC roughly 0.15) ran on instances [2, 7, 10, 13] and four thread blocks of the other kernel (with IPC roughly 0.34) ran on instances [5, 8, 11, 14].

I could determine the correlation with enough experimentation, however I would think this would be queryable via CUPTI or the driver APIs.

little_jimmy · July 30, 2015, 3:01pm

how do you intend to use the data? how can it be useful?

you also do not know in what order instances were seated
perhaps 2, 7, 10, 13 were seated before 5, 8, 11, 14, yielding them some advantage
your thoughts?

Robert_Crovella · July 30, 2015, 4:33pm

I’m guessing you meant:

nvprof --query-metrics

AFAIK some metrics are per-SM, whereas others are gathered from multiple SMs and are a statistical estimate of the behavior of your code across the device. AFAIK there is no way to identify the specific SM from which these measurements were derived, but I may be wrong. That is probably a question for Greg @ NV. He comes by these forums occasionally, perhaps weekly.

Using “instance” when you mean “SM” or “threadblock” certainly confused me, at first.

RKCisco · July 30, 2015, 6:29pm

Hi, yes you are correct, that output came from nvprof, sorry about that. And my understanding is the same as yours, some metrics can be calculated per SM, some per DRAM channel, etc.

The term “instance” is used by the CUPTI library specifically. When profiling a metric that can be captured on a per SM basis, there are the same number of instances as there are SMs (but the numbering scheme is different, which is the gist of my question). When profiling a metric that can be captured on a per DRAM channel basis, there are an equivalent number of CUPTI “instances” (i.e. 6 channels/instances on my Tesla).

A little more context here, my “long running” kernels run forever, rendering the “normal” profiling tools mostly useless (the kernel and application replay scenarios used by the profiler aren’t possible in my use case).

RKCisco · August 5, 2015, 8:14pm

Needed the info so wrote a test case to determine the mapping. For reference, here it is for a Tesla K40m:

SMID     CUPTI Instance
----     --------------
  0            0
  1            3
  2            6
  3            9
  4           12
  5            1
  6            4
  7            7
  8           10
  9           13
 10            2
 11            5
 12            8
 13           11
 14           14

Topic		Replies	Views
CUDA Compute Profiling results What is the profiling mechanism of GPU performance counters? CUDA Programming and Performance	1	3603	May 19, 2011
need help on using cuda profiler CUDA Programming and Performance	0	509	May 17, 2011
Get event metrics per thread or warp via CUPTI CUDA Programming and Performance	1	1448	June 14, 2013
Global Timing and Kernels CUDA Programming and Performance	9	1498	July 17, 2013
Accessing profiler counter from kernel CUDA Programming and Performance	0	870	May 14, 2012
visibility what thread contains to what SM CUDA Programming and Performance	5	1830	August 5, 2013
Profile Info CUDA Programming and Performance	0	2013	July 25, 2010
cuda programming guide question CUDA Programming and Performance	2	1008	July 6, 2009
Reading all events through CUPTI CUDA Programming and Performance	10	2209	May 14, 2015
How to get SM number? CUDA Programming and Performance	1	1031	January 4, 2011

CUPTI mapping of SM to instance

Related topics