Why would code run 1.7x faster when run with nvprof than without?

There is no difference between managed and device managed

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#managed

Are there parts of the kernel that you can “dummy up” or temporarily remove so you can zero in on what portion of the kernel is responsible for the slowdown? In other words, you would try to derive the minimal set of code that still exhibits the same properties as the full code.

Here is another sanity check (by inspection): Is all CUDA code in the application built for the exact compute capability of your GPU?

txbob - yes, I just realized I mixed up a few tests in my head; it did indeed make no difference.

njuffa - in re: valgrind: roger. There was no other unique output other than what I pasted.

The kernel itself is only a handful of lines and operations so it should be easy enough to isolate whether, say, it’s the atomic operation as I suspect.

The code is compiled for sm_60 and run on a P100. I am running the latest driver and toolkit (9.1 as recently released).

I am out of ideas for now.

I assume that the kernel does not include math functions that could have highly data-dependent execution times, e.g. sin(0) vs sin (1e30), and in any event that would not explain the significant variability between application runs, because you already checked that input and output are identical in the slow and in the fast case.

Yes. The only thing that is weird about the kernel (i.e. its intentional implementation) - which is still the exact same no matter what - is that it is proceeding linearly through a 3D array, and has switch-cases based on the value of one of the three indices. There is then a non-one-to-one mapping of threads (based on the large input array index) into the smaller output array which is what is atomically added to. But those details are identical from run-to-run, i.e., they depend on the indices but not the actual data.

Even more fun: while trying to isolate the problem part of the kernel (which I was doing after I wrote the above paragraph), I discovered that the original problem (slower than it should be without nvprof) persists even if I disable the part of the program which calls the problem kernel. So… back to square one?

Actually, now it’s an entirely different kernel which is slowing down over the life of the application - starts at normal speed, and gets slower toward the end. The profiler shows an average time per launch of 4.2364ms minimum of 913.21us (from past experience this is what I believe the average should be), and a maximum of 9.1076ms (which seems to be achieved at the end of the program, e.g. the last time the kernel is called before the program ends).

So maybe a better, more general question would be: what can cause a kernel to get slower on subsequent calls? The kernel is fairly generic, its basically just performing a Runge-Kutta step in the integration of my system of PDEs.

Is the slow-down in this other kernel sufficient to explain the overall app-level performance difference? Some wild speculations:

(1) The GPU is hitting the thermal limit and clocking down temporarily because of that. If you monitor the GPU with nvidia-smi, do you see unusually high temperatures or reports of clock throttling?

(2) Another application (which may or may not be GPU accelerated) is sometimes hammering the host system, or the GPU, or both. Check ps output for such processes, particularly any running at higher priority than your own app (the flimsy hypothesis behind this is not that the kernel itself is getting slower, but that high load interferes with the measurements as well as the host portion of the code).

You might also want to check system logs (e.g. dmesg) to see whether any issues are reported, in particular related to the GPU.

Since we seem to be grasping at straws at this point (basically we are now at: random slowdowns affecting random portions of the app), you might want to try swapping in a different GPU to see whether that makes any difference.

Sorry, should’ve been clear - yes, these discrepancies are on the order of explaining the overall differences.

Will check for (1). (2) - nothing else is running on this machine, except for my ssh and screen sessions.

I do have access to a K40 I can test on freely, but don’t have the ability to swap the specific GPU in this system I have been using. I’m hoping to have access to some Volta hardware (so glad they brought FP64 back to the Titan line) or other P100s, but that wouldn’t be for several months at least.

Thanks for grasping at straws with me, it’s much appreciated. I will report back when I have more concrete data.

There has to be a rational reason that explains your observations, and by continuing with the elimination process, we should eventually get to the bottom of this. The longest bug searches I have participated in in the past were about two weeks working full time just on that one issue, but that was with full access to hardware and software.

Thermals: the problem application, when running slowly, fluctuates rapidly between ~50 and ~90 watts, holds steady at 35 degrees C, and has 97-100% “volatile GPU-util”. My “normal” programs hold steady at ~145 watts and 95-98% of this volatile thing, climbing up to a maximum of ~51C.

I’m just grateful that the bug doesn’t seem to be affecting the results, otherwise I would be freaking out. The slowdown is annoying but I’m not currently scaling to anything large enough to care about it (and - I haven’t checked this recently, but remember observing this in the past - I think the slowdowns went away when I went up to our “production” problem size).

nvidia-smi -q shows that no clock throttling method is active during the program, and performance mode is P0.

Okay, tentative discovery: by commenting out progressively less and less of my code, I was able to zero in on what may be the source of the issue. I was able to eliminate the issue by limiting the number of concurrent kernels I had in one part of my code - eight total in two groups of four, each group accessing different segments of the same array. I have concurrent kernels in similar style elsewhere, so it’s quite unclear to me what effect this could be having. Regardless, as has been evident, it’s been very difficult to make definitive statements in this testing since the behavior is so erratic and (seemingly) non-deterministic. I am hopeful that this is indeed “the” solution, but I am still worried that this could be a fluke.

Never mind. Slowdown persists when nvprof is not used.

Okay! Got rid of ALL overlapping kernel executions and there are no more slowdowns, with or without nvprof. Crossing my fingers that this persists…

Interesting observation / sleuthing. I have no idea what kind of cause-effect relationship tied to concurrent kernels would lead to these significant performance fluctuations.

Okay! Got rid of ALL overlapping kernel executions and there are no more slowdowns, with or without nvprof. Crossing my fingers that this persists…

Edit: oops, double post from refresh. Yep, very strange.