Randomly low GPU utilization

I have a program that computes deterministic results, but when I launch the program I am randomly “lucky” and get a high GPU utilization.

When I start the program I often see ~5-10% GPU usage (via the nvidia-smi), but if I cancel and retry a few times it will eventually “take hold” and the usage goes up to ~95-99% with the program executing noticeably faster (and providing the exact same results).

Does anyone have any insight as to what is causing this? Are there other settings that I need to apply to the card? I get the sense that they’re busy or something.

Like I said, everything is deterministic, whether it runs slowly or quickly the results are the same, all I have to do is cancel and re-run with my fingers crossed, hoping the utilization is high the second time around.

The card itself is a Quadro K2000, the OS is Centos 6.6 with Cuda 6.5. There is (sometimes) a display attached to this card, but the OS is running on init level 3, and smi won’t let me switch it to a compute only anyway.

Thanks.

2 causes that i know of, that introduce a degree of ‘randomness’ into the execution path; i suppose there might be more

  1. the programming guide repeatedly mentions cases/ conditions that would lead to undefined behaviour
    since you obtain the same results each time, i might rule this out

  2. using a variable before its value is set
    the compiler generally warns against this, when it is able to pick it up
    however, the compiler can not always pick it up (for instance, when reusing variables)

i would consider what you are facing, a bug

you either have to debug more intensely, to spot such a point of ‘departure’

or, you could divide your code into sections and/ or functions, and time-stamp these, such that you are able to detect execution time discrepancies over reruns, such that you are able to pin point a point/ region of ‘departure’ with greater accuracy

or you could post code in the hope that someone would miraculously spot the error

Thanks for your response! I like your suggestion of time stamps, I’ll get started on setting that up.

Unfortunately the code is too large to post at this point… It has been an incremental upgrade to CPU based software, so I’m just translating (already working) CPU code into GPU kernels.

I’m not receiving any compiler warnings, and I am confident everything is OK in the code. At this point I was thinking something is just setup incorrectly with the device. I inherited the system, so I didn’t know (and to be honest don’t know how to check) if there is a more involved configuration than simply installing CUDA via the RHEL / CentOS run file.

Two other things that I just realized might be relevant: First, the program exclusively uses host memory, is it possible on some runs, the host memory is what slows it down? Second, I am compiling with a Makefile that I adopted from the template example, so it’s compiling with all the architecture flags… this is probably unnecessary since the program will never leave that computer, but I just found something that worked and never bothered optimizing.

“I’m not receiving any compiler warnings, and I am confident everything is OK in the code”

at least you put a smile on my face. as i have discovered, and what i continue to discover, is that the purpose of debugging is to correct code perfectly written, but that hardly does what is intended

“I didn’t know (and to be honest don’t know how to check) if there is a more involved configuration than simply installing CUDA via the RHEL / CentOS run file”

it should be a breeze to switch to and fro level 3 and 5; under level 5, you would be able to more easily examine the ide/ project setup; you could always revert from level 5 to 3, when you know the settings you desire work
perhaps the command line is more powerful, but the gui is easy to get the ball rolling (in the right direction)

“the program exclusively uses host memory, is it possible on some runs, the host memory is what slows it down”

in general, the same code and reruns thereof, should have the same or at least a predictable execution time; if the host memory is correctly utilized, it should not lead to significant/ inexplicable execution time discrepancies

" I am compiling with a Makefile that…"

perhaps also create a (new) project in the (gui) ide, and equally use that to cross-compare

lastly, i realized that poor synchronization may also lead to a general scenario of “using a variable before its value is set”