Interpretation of CUPTI results

salmon.sh · April 21, 2022, 8:35am

Hi. I’m trying to interpret the result of CUPTI sample, pc_sampling.

I got this result.

1  Device Name: NVIDIA GeForce RTX 3090
2  Device compute capability: 8.6
3  id 1, ctx 1, moduleId 18, functionIndex 8, name _Z6VecAddPKiS0_Pii
4  Source Locator Id 1, File /home/user01/ksy/test/pc_sampling.cu Line 46
5  source 1, functionId 1, pc 0x100, corr 1, samples 1, latency samples 1, stallreason Memory throttle
6  Source Locator Id 2, File /home/user01/ksy/test/pc_sampling.cu Line 45
7  source 2, functionId 1, pc 0xd0, corr 1, samples 12, latency samples 12, stallreason Memory dependency
8  Source Locator Id 3, File /home/user01/ksy/test/pc_sampling.cu Line 43
9  source 3, functionId 1, pc 0x10, corr 1, samples 1, latency samples 1, stallreason Execution dependency
10 Source Locator Id 4, File /home/user01/ksy/test/pc_sampling.cu Line 42
11 source 4, functionId 1, pc 0x0, corr 1, samples 3, latency samples 3, stallreason Instruction fetch
12 source 4, functionId 1, pc 0x0, corr 1, samples 7, latency samples 7, stallreason Constant memory dependency
13 source 2, functionId 1, pc 0x70, corr 1, samples 2, latency samples 2, stallreason Constant memory dependency
14 Source Locator Id 5, File /home/user01/ksy/test/pc_sampling.cu Line 44
15 source 5, functionId 1, pc 0x40, corr 1, samples 1, latency samples 1, stallreason Execution dependency
16 source 5, functionId 1, pc 0x40, corr 1, samples 13, latency samples 13, stallreason Constant memory dependency
17 source 5, functionId 1, pc 0x50, corr 1, samples 1, latency samples 1, stallreason Execution dependency
18 source 2, functionId 1, pc 0xa0, corr 1, samples 1, latency samples 0, stallreason Selected
19 corr 1, totalSamples 42, droppedSamples 0, samplingPeriodInCycles 32

and kernel code

42 __global__ void VecAdd(const int *A, const int *B, int *C, int N) {
43   int i = blockDim.x * blockIdx.x + threadIdx.x;
44   if (i < N)
45     C[i] = A[i] + B[i];
46 }

So it is right to interpret this result like this?:

for the first sampling period, Memory throttle happened in line 46
for the 2nd~13th period(12 samples), there was memory dependency in line 45

If it is right to interpret like this, then how come source locator is located at line 10?
I mean line 42 is starting line of kernel code, so doesn’t it have to be in the top of the sampling if sampled?

Thank you in advance!

ssubudhi · May 2, 2022, 7:45am

Hi Salmon,

Here you are assuming that instruction flow happens from top to bottom as per kernel code, whereas it happens based on SASS instructions generated for this kernel. Once you convert the kernel to sass instruction you will get the idea. You can use cuobjdump for generating sass code (CUDA Binary Utilities :: CUDA Toolkit Documentation)

Lets consider line 45: C[i] = A[i] + B[i];

there are multiple instructions set to execute, first, we Load instructions (LDG) for A and B and then we have integer addition (IADD) and then finally we store the computed value in C (STG). I’m just highlighting the important ones there are other instructions which will be executed. For the load operation if we correlate to kernel code it will point to Line 42 (kernel code) and for this reason, we are seeing Line 42 (kernel code) come in a later part of the pc sampling records as there is some stall happening because of memory dependencies.

salmon.sh · May 2, 2022, 4:12pm

Okay. I should look at the instructions.
But I’m not sure about one thing. Why does load operation point to line 42? Is it because A and B is declared in that line?
I’m thinking it should be line 45.

Greg · May 3, 2022, 10:31pm

The only lines associated with Line 42 are

11 source 4, functionId 1, pc 0x0, corr 1, samples 3, latency samples 3, stallreason Instruction fetch
12 source 4, functionId 1, pc 0x0, corr 1, samples 7, latency samples 7, stallreason Constant memory dependency

You can view the SASS disassembly either by (a) using Nsight Compute, (b) running nvdisasm/cuobjdump, or (c) using godbolt.

The instruction fetches are due to start up cost for new warps to fetch the first instruction.
The constant memory dependency is for constant cache misses. These are from instructions with c[bank][offset] in the opcode or LDC instructions. These exist in all global functions.

If you are trying to use CUPTI to get PC sampling data I highly recommend you run the same application in Nsight Compute to make sure you are correctly interpreting the data.

Topic		Replies	Views
question about PC_SAMPLING in CUPTI library CUDA Programming and Performance	10	1109	March 13, 2018
Read contents of PC Sampling Continuous output (dat file) CUPTI – CUDA Profiler Tools Interface	5	975	April 20, 2023
Line-by-line profiling for a single-kernel CUDA program CUDA Programming and Performance	6	3122	August 19, 2015
PC Sampling leads to large slow-downs in execution time? CUPTI – CUDA Profiler Tools Interface	1	888	August 16, 2019
How to correlate PC samples with CUDA calls in new version PC Sampling APIs CUPTI – CUDA Profiler Tools Interface	0	735	May 5, 2022
CUPTI Sample Code problem CUPTI – CUDA Profiler Tools Interface cuda , kernel	12	156	July 28, 2025
How to use CUPTI to get average instruction execution time? CUDA Programming and Performance	7	1051	March 20, 2018
NVIDIA® CUDA Profiler Tools Interface (CUPTI) for CUDA Toolkit 11.3 is now available CUPTI – CUDA Profiler Tools Interface	4	947	June 29, 2021
How to use CUPTI to get metrics for the Device Attributes CUPTI – CUDA Profiler Tools Interface cuda , samples , profiling	2	1000	December 13, 2022
CUDA Profiler Tools Interface (CUPTI) for CUDA Toolkit 12.0 is now available CUPTI – CUDA Profiler Tools Interface	5	1606	October 23, 2024

Interpretation of CUPTI results

Related topics