question about PC_SAMPLING in CUPTI library

Hello,

I have a question about the PC_SAMPLING feature supported in CUPTI.
Since the sampling period is defined in GPU clock cycles. Which thread/wrap does it target at each sample point since we can have thousands of threads(warp based) running and stalling at different instruction at the same moment right (different threads have different stacks)? When each sample is correlated to source information, it refers to the location information of which thread?

Thanks

PC sampling is available in GM20x and later GPUs. Each SM has a PC sampling unit that can be configured to sample every N cycles. N is configurable through CUPTI. On each interupt the sampling unit selects an active warp using a round robin selection algorithm and sends out a record that includes the selected warp’s program counter address and the warp’s scheduler state. The program address is the address of the next instruction to be scheduled.

Hello, Greg

Thanks a lot for your explanation!
So if only one active warp is selected and its state is recorded. Then is it representative in terms of revealing performance issue?

Besides, when I used PC_SAMPLING on the lulesh (cuda benchmark downloaded from LLNL website, the execution time grows to 20x of the original even if I use the maximum period. Does it sound normal? (I couldn’t find my last post where you had your comments about it, it got deleted by the forum administrator somehow :( )

Thanks

To my knowledge CUPTI (CUDA 8.0-9.0)serializes the kernel launch and enters a replay loop when collecting PC sampling. This is what is likely adding the overhead. During execution of the kernel there should only be interference in performance if the kernel is saturating GPU to system memory bandwidth.

On each sample period on 1 warp per SM is selected. This is a method of statistical sampling and is accurate if sufficient samples are captured. It may be necessary to replay the kernel multiple times to collect sufficient samples.

I do not know if CUPTI has implemented a convergence algorithm to determine how many times to replay the kernel. In the initial release of PC sampling CUPTI only executed the kernel 1 time which was not sufficient for short kernels that had only 1 wave of thread blocks.

Hello, Greg

Is it possible to get the GPU callstack at each sample point when using PC_SAMPLING? Is it too expensive for each record?

Thanks

Hello, Greg

Currently, is there any way to associate some data (or add a callback) whenever a new activity is recorded ??? I need to get some extra data besides what CUpti_ActivityPCSampling3 provides.

Thanks

From 12/06/2017 - No, it is not currently possible to collect a GPU callstack. The current implementation uses a sampler in hardware (not interrupt) that is not capable of reading the callstack. Reading the callstack is very expensive and would have a very large impact on the performance of the kernel.

From 12/20/2017 - What additional information are you trying to collect? No additional information can be collected at the time of the sample on the GPU. I’m trying to determine if you need meta data from the CPU or if you are hoping to collect more information from the kernel itself. The PC sampling is not implement as an interrupt handler as is common on a microprocessor as the weight of a trap is too expensive on the GPU and would have to big of an impact on the execution of the kernel.

Thanks a lot, Greg!

I’m still trying to get the GPU callstack while sampling. Since you said it’s not possible for CUPTI to get it, I was thinking to build a shadow stack as the kernel executes, and associate that with the samples. However, based on your answer, that doesn’t seem like an option.

Thanks

hzhang86

Since you’ve obviously played with the PC_SAMPLING example, did you happen to look at the sass_source_map
example? If so, did you ever try to do this

#define DUMP_CUBIN 1

and then manipulate the cubin that it wrote?

–Bob

Hello, Bob

No, I don’t need cubin so I disabled it.

Thanks hzhang86

The comment in the code tells you to run nvdisasm on the cubin. When I do that with their
example I get error.

–Bob