Kernel pipeline slows gradually

Rick_Frank · December 19, 2024, 10:05am

I’m running a pipeline in a loop.

At first takes 38ms then slowly increases time to about 65-70 Ms then plateaus there.

Why might this happen? I monitor the temperature of the GPU and it isn’t heating up. I have Windows 11 performance settings on best performance, to my knowledge…

Any thoughts.

Thanks

Curefab · December 19, 2024, 10:26am

How long does it take for it to slow down vs. the overall time of 38ms to 65-70ms?

Are the pipeline steps between different iterations overlapping?

How do you measure the time?

As you posted in the Cuda forum, can you run Nsight Compute to test, what the normal running time of your kernel(s) is. Then you can calculate between base and boost clock frequency.

Rick_Frank · December 19, 2024, 12:44pm

The pipeline is part of a unit test….
the time is measured in C++

Pseudo code

allocate_device_memory();

for i = 0 to 305
cv::imread(“image_i_.png”)
……copy_to_device
start = std::chrono::now
Pipeline_uses_12_streams();
end = std::chrono::now
printf start-end

GTESTASSERT_EQ(something);

cv::imwrite(output.png)
end for

we only care about pipeline time at this point. The output images on disk indicates the kernels are giving the correct output. The tests takes about 5 seconds or so.

I’ll run profiler see if anything obvious points out…
We need the fastest time possible since the goal is to process video frames, so the slowing down is a problem. Memory is pre-allocated since images size is constant.

Thanks

Curefab · December 19, 2024, 1:26pm

Normally one sees the opposite effect that the first run(s) is slower and then it picks up a bit.

I would research how to lock the GPU to the base clock for benchmarking purposes.

I would introduce warmup calls nevertheless before the benchmark.

I would use Nsight Systems to see, whether there is a change between the first and the last iterations regarding the kernel, the GPU or the overall system.

I would use Nsight Compute with and without caching between iterations and with base and with boost clock to see how long each kernel should officially take (there is the difficulty to map this result to the whole pipeline with 12 streams).

njuffa · December 19, 2024, 5:45pm

In that case, I would claim the test scaffolding should look like this:

for i = 0 to 305
cv::imread(“image_i_.png”)
……copy_to_device
cudaDeviceSynchronize(); // ensure GPU is done with all previously issued work
start = std::chrono::now
Pipeline_uses_12_streams();
cudaDeviceSynchronize(); // ensure GPU is done with workload under test
end = std::chrono::now
printf start-end

Profiling will provide truly meaningful data. From the scant information provided here one could speculate endlessly. By the way, it is unclear why the same image is being processed 305 times here. If these are in fact 305 different images, so this is pseudo code is processing image[i], my next question would be, does the control flow of the GPU code include any dependencies on the data being processed? If so, what do kernel execution times look like when the images are processed in reverse order? What work other than execution GPU kernels happens in Pipeline_uses_12_streams();?

Curefab · December 19, 2024, 5:48pm

Do you allocate or free memory? It could be that the memory gets more fragmented. And you should try to avoid memory allocations and freeing for each iteration, if you optimize for maximum performance.

Rick_Frank · December 21, 2024, 2:06am

I have some data; The performance tracks inversely to the power; That is, the GPU power is high and then throttles back. If I plot the ms in my kernel and the power you can see the trend clearly. So, the system is throttling the power; I’m on a laptop, which may be doing this automatically, even though I think I have set everything I can to “best performance”.

njuffa · December 21, 2024, 2:34am

I currently do not have a mental model that would explain these observations.

In order to extract more performance from silicon despite the near death of Moore’s Law, modern CPUs and GPUs try to use reduce safety margins to within a few percent of failure. They do this by dynamically “boosting” clock frequencies and operating voltages, and they adjust these downwards when thermal and power limits are exceeded that would render processor operation unreliable. This throttling is enforced by sensors and power management mechanisms provided by the GPU hardware / firmware / driver itself and generally agnostic of the workload.

nvidia-smi can show you the current power limit of your GPU as well the maximum configurable power limit. The default power limit is often lower than the maximum that can be specified; you may want to raise the enforced power limit to the maximum allowed using nvidia-smi (you may need root / administrator privileges to do so).

I do not use laptops, so it is possible that additional power constraints are enforced on those by system components outside the GPU’s responsibility. The GPU throttling mechanisms should be independent of the (non-)use of gdm. In my experience they are stringent and fast acting (on the order of milliseconds), kicking in when power consumption exceeds about 85% of the enforced maximum. However, when monitoring a GPU for a long time one can see that actual power consumption can exceed the power limit by some percent, and by using high-resolution sensors people have shown that short-term spikes (in the microsecond range) up to 20% above the limit are quite common.

Rick_Frank · December 21, 2024, 3:03am

The gpu I’m using on the laptop is not the target GPU; the target GPU is a P3000.
This is a medical device with a custom motherboard. The device is already shipping and in use in the field so an upgrade to the GPU is not in the cards short term. The Pascal series does not support NVidia Compute, so I can’t run the profiling tools on the actual hardware.

I’m using nvvm thru CUDA to monitor the power. The P3000 is also classified as a mobile device, so it too may have the OS ( Windows IoT embedded) adjust power to preserve battery ( or heat or something).
I’ll be testing on the target hardware after the holidays .
Since the company designed the motherboard with an OEM, perhaps we can control the power a bit; I’m also monitoring the temperature on my laptop and the fan doesn’t even come in, so it must be trying to preserve power to be green or something (?) even though I’m plugged into the wall.

rs277 · December 21, 2024, 3:23am

Hi,

For what it’s worth, the last version of Nsight Compute supported on Pascal 6.1, was 2019.5.0, available here, for Linux and here for Windows.

It’s capabilities are somewhat reduced compared to the current version, but it may offer you some value.

Edited to add link to Windows version.

njuffa · December 21, 2024, 3:34am

I would encourage you to perform all performance measurements exclusively on the target device. From experience, performance observations are rarely portable between two different platforms, no matter how clever the reasoning one uses to translate the results. With a bit of work you should be able to deploy one of the target systems in your regression farm, so both correctness and performance “smoke” tests can be run on the actual target hardware with every code check-in.

Robert_Crovella · December 21, 2024, 3:15pm

It’s entirely possible this is due to laptop/mobile power management. In fact, I would say you have probably answered the original question "Why might this happen? " with the observation “The performance tracks inversely to the power;”. Neither NVIDIA nor the laptop manufacturer publish detailed specifications about how power is managed.

NVIDIA doesn’t provide adjustments for end-user control over mobile power management. However you may wish to explore what is in the windows control panel, it seems you have already done that.

To some degree, even datacenter devices do something that could be viewed as “similar”, although the timeframes in my experience are probably much shorter. “Instantaneous” or very short excursions above the device maximum power may be observed or observable from time to time depending on the workload. However the presence of these will generally cause device power management to “kick in”, lowering clocks, which will have also a noticeable, sometimes dramatic effect, on the performance profile. There isn’t anything that can be done about that either. Locking the GPU clocks or similar does not override what are considered to be “necessary” power management operations; you cannot go beyond device design limits, and to some degree design limits may have a time component to them.

One memorable GPU for me was the T4. With a bit of searching you can find people who report that small, “short” usage of tensorcore ops on T4 can result in relatively higher observed FLOPs/s, but continuous or larger usage of the tensorcore ops generally causes the device to “drop down” to a steady state lower performance level. This is expected behavior and is due to power management.

Topic		Replies	Views
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1763	July 19, 2022
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	738	December 18, 2017
Needing expert advice.. CUDA Programming and Performance	4	1268	July 21, 2014
Profiling in a code line resolution CUDA Programming and Performance	7	7057	December 6, 2011
Profiling and Optimizing Deep Neural Networks with DLProf and PyProf Technical Blog	13	1414	August 11, 2021
Can't Get NCU GUI To Import Properly Nsight Compute	8	1348	October 5, 2020
Nsight Compute-Roofline chart Nsight Compute	12	1566	September 20, 2024
Kernels in CUDA streams seems not running in parallel Profiling Linux Targets	8	830	April 7, 2024
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2366	January 9, 2017
How to explain the performance difference? CUDA Programming and Performance	7	3506	March 26, 2008

Kernel pipeline slows gradually

Related topics