Cuda sporadically slows down

RichardKarls · May 28, 2018, 11:59am

Hi,

I’m having an application that contains 2 Cuda kernels being run repeatedly on different data sets using an Nvidia Quadro M4000. They normally take 35 and 180-200 ms respectively for each call. I have run the application on multiple different PCs with the same hardware setup and it has been running fine on each of them, except for the last one I installed the application on.

At the start of the application, the kernels start at their usual runtimes of 35 ms and 200ms and will then slow down to ~160 ms and ~1000 ms respectively (so approx. by a factor of 5). This happens at a seemingly random point in the application, even when using the same input data. Once the application has slowed down, it will remain slow throughout its remaining lifetime. Restarting the application will bring it back to its glorious faster self (though sometimes, the point at which the application slows down immediately at its start).

Like most of the other PCs that this application was run on, the PC was freshly set up with Windows 10. There should not be any other programs that would be using the GPU, let alone extensively enough to slow down my application significantly.

I have tried:

restarting the pc
updating the graphics driver
installing a second hard drive and switching the Quadro M4000 to TCC
running the application on other PCs using the same data
checking the task manager for any applications that shouldn’t be there
All to no avail.

Do you have any idea what might be causing this curious type of behavior or what could help finding out? If it’s hardware failure I would expect it to be more fatal (could it be a heat issue?). If it’s another program competing for the GPU, I would expect the behavior to be less erratic. If it’s the application or the data, I would expect it to be more consistent throughout multiple runs.

Caveats:

I’m using the Driver API rather than the Runtime API
the kernel runtimes were measured using synchronization and timing events rather than the profiler, as there is no profiler installed on that pc
I don’t know of a way to make sure nothing else is using the GPU, even though I’m confident that there shouldn’t be anything
I am unsure to what extent I can give you application code, if it happens to be relevant

RichardKarls · May 28, 2018, 12:24pm

Nevermind. Shortly after posting this, I was informed that we have another Quadro M4000 available for testing. Installing it fixed the problem (and revealed that the original Quadro M4000 was exceedingly hot). Turns out it was merely a hardware problem after all.

Robert_Crovella · May 28, 2018, 2:19pm

Yes, overheating would fit the pattern you describe.