I’m having an application that contains 2 Cuda kernels being run repeatedly on different data sets using an Nvidia Quadro M4000. They normally take 35 and 180-200 ms respectively for each call. I have run the application on multiple different PCs with the same hardware setup and it has been running fine on each of them, except for the last one I installed the application on.
At the start of the application, the kernels start at their usual runtimes of 35 ms and 200ms and will then slow down to ~160 ms and ~1000 ms respectively (so approx. by a factor of 5). This happens at a seemingly random point in the application, even when using the same input data. Once the application has slowed down, it will remain slow throughout its remaining lifetime. Restarting the application will bring it back to its glorious faster self (though sometimes, the point at which the application slows down immediately at its start).
Like most of the other PCs that this application was run on, the PC was freshly set up with Windows 10. There should not be any other programs that would be using the GPU, let alone extensively enough to slow down my application significantly.
I have tried:
- restarting the pc
- updating the graphics driver
- installing a second hard drive and switching the Quadro M4000 to TCC
- running the application on other PCs using the same data
- checking the task manager for any applications that shouldn’t be there
All to no avail.
Do you have any idea what might be causing this curious type of behavior or what could help finding out? If it’s hardware failure I would expect it to be more fatal (could it be a heat issue?). If it’s another program competing for the GPU, I would expect the behavior to be less erratic. If it’s the application or the data, I would expect it to be more consistent throughout multiple runs.
- I’m using the Driver API rather than the Runtime API
- the kernel runtimes were measured using synchronization and timing events rather than the profiler, as there is no profiler installed on that pc
- I don’t know of a way to make sure nothing else is using the GPU, even though I’m confident that there shouldn’t be anything
- I am unsure to what extent I can give you application code, if it happens to be relevant