Different durations reported by nvprof for the same kernel.

oukore · December 5, 2019, 10:12pm

Hi guys,

Here I have a CUDA kernel called in two programs. In one program (A), it was compiled to PTX and then was loaded as part of a CUmodule and executed as a CUfunciton. In the other program (B), it was compiled to an object file and statically linked. With the same compiler flags, I can confirm that the PTX codes of this kernel generated in these two programs are identical. What confused me was I would constantly get 100 us (A) vs. 150 us (B) when I profiled these two programs with nvprof with the same input and running on the same GPU. I thought they should have been very close.

I have also checked other possible factors, including static/dynamic allocated shared memory (not used), gird size and block size (same), cache config (all prefer L1), and I am still clueless. Are there any suggestions for me to further investigate this problem?

Thanks.

cbuchner1 · December 6, 2019, 8:47am

could it be the GPU’s power management? Maybe it takes a bit of time for CUDA to fully ramp up the P state from idle state. And in that kernel it might depend on the prior workloads whether the best performing P state for CUDA has been reached.

I don’t know if you are calling that kernel repeatedly, causing constant load on the GPU or whether it’s a one time call.

For some of the professional grade CUDA cards, application clock speeds of the card and memory can be set to specific values. The consumer level cards use a more dynamic clock speed management (within configurable power limits)

cbuchner1 · December 6, 2019, 8:54am

Try this command to query P-state and clocks periodically

nvidia-smi --query-gpu=timestamp,pstate,clocks.current.graphics,clocks.current.memory --format=csv -l 1

oukore · December 6, 2019, 8:58pm

Interesting… The kernel should have been called repeatedly in the application but when I profiled it, I let it run at most twice. I have never considered this might be an issue though (and never heard of it), especially when I didn’t notice big fluctuations in the reported duration.

The reported P states all started from P8 and turned to P0, seemingly before this kernel was launched. It seemed that the power management didn’t explain the difference here but if the reason was indeed obscure like this, I intended to pass it for now.

Thank you.