Interpretation of CUPTI results

Hi. I’m trying to interpret the result of CUPTI sample, pc_sampling.

I got this result.

1  Device Name: NVIDIA GeForce RTX 3090
2  Device compute capability: 8.6
3  id 1, ctx 1, moduleId 18, functionIndex 8, name _Z6VecAddPKiS0_Pii
4  Source Locator Id 1, File /home/user01/ksy/test/pc_sampling.cu Line 46
5  source 1, functionId 1, pc 0x100, corr 1, samples 1, latency samples 1, stallreason Memory throttle
6  Source Locator Id 2, File /home/user01/ksy/test/pc_sampling.cu Line 45
7  source 2, functionId 1, pc 0xd0, corr 1, samples 12, latency samples 12, stallreason Memory dependency
8  Source Locator Id 3, File /home/user01/ksy/test/pc_sampling.cu Line 43
9  source 3, functionId 1, pc 0x10, corr 1, samples 1, latency samples 1, stallreason Execution dependency
10 Source Locator Id 4, File /home/user01/ksy/test/pc_sampling.cu Line 42
11 source 4, functionId 1, pc 0x0, corr 1, samples 3, latency samples 3, stallreason Instruction fetch
12 source 4, functionId 1, pc 0x0, corr 1, samples 7, latency samples 7, stallreason Constant memory dependency
13 source 2, functionId 1, pc 0x70, corr 1, samples 2, latency samples 2, stallreason Constant memory dependency
14 Source Locator Id 5, File /home/user01/ksy/test/pc_sampling.cu Line 44
15 source 5, functionId 1, pc 0x40, corr 1, samples 1, latency samples 1, stallreason Execution dependency
16 source 5, functionId 1, pc 0x40, corr 1, samples 13, latency samples 13, stallreason Constant memory dependency
17 source 5, functionId 1, pc 0x50, corr 1, samples 1, latency samples 1, stallreason Execution dependency
18 source 2, functionId 1, pc 0xa0, corr 1, samples 1, latency samples 0, stallreason Selected
19 corr 1, totalSamples 42, droppedSamples 0, samplingPeriodInCycles 32

and kernel code

42 __global__ void VecAdd(const int *A, const int *B, int *C, int N) {
43   int i = blockDim.x * blockIdx.x + threadIdx.x;
44   if (i < N)
45     C[i] = A[i] + B[i];
46 }

So it is right to interpret this result like this?:

  1. for the first sampling period, Memory throttle happened in line 46
  2. for the 2nd~13th period(12 samples), there was memory dependency in line 45

If it is right to interpret like this, then how come source locator is located at line 10?
I mean line 42 is starting line of kernel code, so doesn’t it have to be in the top of the sampling if sampled?

Thank you in advance!

Hi Salmon,

Here you are assuming that instruction flow happens from top to bottom as per kernel code, whereas it happens based on SASS instructions generated for this kernel. Once you convert the kernel to sass instruction you will get the idea. You can use cuobjdump for generating sass code (CUDA Binary Utilities :: CUDA Toolkit Documentation)

Lets consider line 45: C[i] = A[i] + B[i];

there are multiple instructions set to execute, first, we Load instructions (LDG) for A and B and then we have integer addition (IADD) and then finally we store the computed value in C (STG). I’m just highlighting the important ones there are other instructions which will be executed. For the load operation if we correlate to kernel code it will point to Line 42 (kernel code) and for this reason, we are seeing Line 42 (kernel code) come in a later part of the pc sampling records as there is some stall happening because of memory dependencies.

Okay. I should look at the instructions.
But I’m not sure about one thing. Why does load operation point to line 42? Is it because A and B is declared in that line?
I’m thinking it should be line 45.

The only lines associated with Line 42 are

11 source 4, functionId 1, pc 0x0, corr 1, samples 3, latency samples 3, stallreason Instruction fetch
12 source 4, functionId 1, pc 0x0, corr 1, samples 7, latency samples 7, stallreason Constant memory dependency

You can view the SASS disassembly either by (a) using Nsight Compute, (b) running nvdisasm/cuobjdump, or (c) using godbolt.

The instruction fetches are due to start up cost for new warps to fetch the first instruction.
The constant memory dependency is for constant cache misses. These are from instructions with c[bank][offset] in the opcode or LDC instructions. These exist in all global functions.

If you are trying to use CUPTI to get PC sampling data I highly recommend you run the same application in Nsight Compute to make sure you are correctly interpreting the data.