question about PC_SAMPLING in CUPTI library

hzhang86 · November 15, 2017, 7:42pm

Hello,

I have a question about the PC_SAMPLING feature supported in CUPTI.
Since the sampling period is defined in GPU clock cycles. Which thread/wrap does it target at each sample point since we can have thousands of threads(warp based) running and stalling at different instruction at the same moment right (different threads have different stacks)? When each sample is correlated to source information, it refers to the location information of which thread?

Thanks

Greg · November 16, 2017, 3:27am

PC sampling is available in GM20x and later GPUs. Each SM has a PC sampling unit that can be configured to sample every N cycles. N is configurable through CUPTI. On each interupt the sampling unit selects an active warp using a round robin selection algorithm and sends out a record that includes the selected warp’s program counter address and the warp’s scheduler state. The program address is the address of the next instruction to be scheduled.

hzhang86 · November 16, 2017, 2:47pm

Hello, Greg

Thanks a lot for your explanation!
So if only one active warp is selected and its state is recorded. Then is it representative in terms of revealing performance issue?

Besides, when I used PC_SAMPLING on the lulesh (cuda benchmark downloaded from LLNL website, the execution time grows to 20x of the original even if I use the maximum period. Does it sound normal? (I couldn’t find my last post where you had your comments about it, it got deleted by the forum administrator somehow :( )

Thanks

Greg · November 16, 2017, 2:57pm

To my knowledge CUPTI (CUDA 8.0-9.0)serializes the kernel launch and enters a replay loop when collecting PC sampling. This is what is likely adding the overhead. During execution of the kernel there should only be interference in performance if the kernel is saturating GPU to system memory bandwidth.

On each sample period on 1 warp per SM is selected. This is a method of statistical sampling and is accurate if sufficient samples are captured. It may be necessary to replay the kernel multiple times to collect sufficient samples.

I do not know if CUPTI has implemented a convergence algorithm to determine how many times to replay the kernel. In the initial release of PC sampling CUPTI only executed the kernel 1 time which was not sufficient for short kernels that had only 1 wave of thread blocks.

hzhang86 · December 6, 2017, 9:29pm

Hello, Greg

Is it possible to get the GPU callstack at each sample point when using PC_SAMPLING? Is it too expensive for each record?

Thanks

hzhang86 · December 20, 2017, 11:36pm

Hello, Greg

Currently, is there any way to associate some data (or add a callback) whenever a new activity is recorded ??? I need to get some extra data besides what CUpti_ActivityPCSampling3 provides.

Thanks

Greg · December 21, 2017, 12:03am

From 12/06/2017 - No, it is not currently possible to collect a GPU callstack. The current implementation uses a sampler in hardware (not interrupt) that is not capable of reading the callstack. Reading the callstack is very expensive and would have a very large impact on the performance of the kernel.

From 12/20/2017 - What additional information are you trying to collect? No additional information can be collected at the time of the sample on the GPU. I’m trying to determine if you need meta data from the CPU or if you are hoping to collect more information from the kernel itself. The PC sampling is not implement as an interrupt handler as is common on a microprocessor as the weight of a trap is too expensive on the GPU and would have to big of an impact on the execution of the kernel.

hzhang86 · December 21, 2017, 3:53pm

Thanks a lot, Greg!

I’m still trying to get the GPU callstack while sampling. Since you said it’s not possible for CUPTI to get it, I was thinking to build a shadow stack as the kernel executes, and associate that with the samples. However, based on your answer, that doesn’t seem like an option.

Thanks

bz1 · March 13, 2018, 4:19am

hzhang86

Since you’ve obviously played with the PC_SAMPLING example, did you happen to look at the sass_source_map
example? If so, did you ever try to do this

#define DUMP_CUBIN 1

and then manipulate the cubin that it wrote?

–Bob

hzhang86 · March 13, 2018, 4:22am

Hello, Bob

No, I don’t need cubin so I disabled it.

bz1 · March 13, 2018, 4:38am

Thanks hzhang86

The comment in the code tells you to run nvdisasm on the cubin. When I do that with their
example I get error.

–Bob

Topic		Replies	Views
PC Sampling leads to large slow-downs in execution time? CUPTI – CUDA Profiler Tools Interface	1	917	August 16, 2019
How to correlate PC samples with CUDA calls in new version PC Sampling APIs CUPTI – CUDA Profiler Tools Interface	0	747	May 5, 2022
GPU performance counters CUDA Programming and Performance	6	1986	April 3, 2013
Error with running CUPTI PC Sampling sample CUDA Programming and Performance	4	811	August 7, 2019
Interpretation of CUPTI results CUPTI – CUDA Profiler Tools Interface	3	987	May 3, 2022
Question about PC sampling Nsight Compute	3	584	December 20, 2023
Get event metrics per thread or warp via CUPTI CUDA Programming and Performance	1	1449	June 14, 2013
Collecting Events/Metrics at 2^n clock cycles Visual Profiler and nvprof	3	849	December 8, 2022
How to correctly measure kernel exec time? CUDA Programming and Performance	2	3124	March 19, 2008
Clock Cycles of CUDA kernel How to determine the clock cycles...? CUDA Programming and Performance	9	14890	June 27, 2008

question about PC_SAMPLING in CUPTI library

Related topics