Why would code run 1.7x faster when run with nvprof than without?

Robert_Crovella · December 21, 2017, 10:23pm

There is no difference between managed and device managed

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#managed

njuffa · December 21, 2017, 10:25pm

Are there parts of the kernel that you can “dummy up” or temporarily remove so you can zero in on what portion of the kernel is responsible for the slowdown? In other words, you would try to derive the minimal set of code that still exhibits the same properties as the full code.

Here is another sanity check (by inspection): Is all CUDA code in the application built for the exact compute capability of your GPU?

zjw518 · December 21, 2017, 10:29pm

txbob - yes, I just realized I mixed up a few tests in my head; it did indeed make no difference.

njuffa - in re: valgrind: roger. There was no other unique output other than what I pasted.

The kernel itself is only a handful of lines and operations so it should be easy enough to isolate whether, say, it’s the atomic operation as I suspect.

The code is compiled for sm_60 and run on a P100. I am running the latest driver and toolkit (9.1 as recently released).

njuffa · December 21, 2017, 10:48pm

I am out of ideas for now.

I assume that the kernel does not include math functions that could have highly data-dependent execution times, e.g. sin(0) vs sin (1e30), and in any event that would not explain the significant variability between application runs, because you already checked that input and output are identical in the slow and in the fast case.

zjw518 · December 21, 2017, 11:27pm

Yes. The only thing that is weird about the kernel (i.e. its intentional implementation) - which is still the exact same no matter what - is that it is proceeding linearly through a 3D array, and has switch-cases based on the value of one of the three indices. There is then a non-one-to-one mapping of threads (based on the large input array index) into the smaller output array which is what is atomically added to. But those details are identical from run-to-run, i.e., they depend on the indices but not the actual data.

Even more fun: while trying to isolate the problem part of the kernel (which I was doing after I wrote the above paragraph), I discovered that the original problem (slower than it should be without nvprof) persists even if I disable the part of the program which calls the problem kernel. So… back to square one?

zjw518 · December 21, 2017, 11:50pm

Actually, now it’s an entirely different kernel which is slowing down over the life of the application - starts at normal speed, and gets slower toward the end. The profiler shows an average time per launch of 4.2364ms minimum of 913.21us (from past experience this is what I believe the average should be), and a maximum of 9.1076ms (which seems to be achieved at the end of the program, e.g. the last time the kernel is called before the program ends).

So maybe a better, more general question would be: what can cause a kernel to get slower on subsequent calls? The kernel is fairly generic, its basically just performing a Runge-Kutta step in the integration of my system of PDEs.

njuffa · December 22, 2017, 5:00am

Is the slow-down in this other kernel sufficient to explain the overall app-level performance difference? Some wild speculations:

(1) The GPU is hitting the thermal limit and clocking down temporarily because of that. If you monitor the GPU with nvidia-smi, do you see unusually high temperatures or reports of clock throttling?

(2) Another application (which may or may not be GPU accelerated) is sometimes hammering the host system, or the GPU, or both. Check ps output for such processes, particularly any running at higher priority than your own app (the flimsy hypothesis behind this is not that the kernel itself is getting slower, but that high load interferes with the measurements as well as the host portion of the code).

You might also want to check system logs (e.g. dmesg) to see whether any issues are reported, in particular related to the GPU.

Since we seem to be grasping at straws at this point (basically we are now at: random slowdowns affecting random portions of the app), you might want to try swapping in a different GPU to see whether that makes any difference.

zjw518 · December 22, 2017, 5:09am

Sorry, should’ve been clear - yes, these discrepancies are on the order of explaining the overall differences.

Will check for (1). (2) - nothing else is running on this machine, except for my ssh and screen sessions.

I do have access to a K40 I can test on freely, but don’t have the ability to swap the specific GPU in this system I have been using. I’m hoping to have access to some Volta hardware (so glad they brought FP64 back to the Titan line) or other P100s, but that wouldn’t be for several months at least.

Thanks for grasping at straws with me, it’s much appreciated. I will report back when I have more concrete data.

njuffa · December 22, 2017, 5:26am

There has to be a rational reason that explains your observations, and by continuing with the elimination process, we should eventually get to the bottom of this. The longest bug searches I have participated in in the past were about two weeks working full time just on that one issue, but that was with full access to hardware and software.

zjw518 · December 22, 2017, 3:45pm

Thermals: the problem application, when running slowly, fluctuates rapidly between ~50 and ~90 watts, holds steady at 35 degrees C, and has 97-100% “volatile GPU-util”. My “normal” programs hold steady at ~145 watts and 95-98% of this volatile thing, climbing up to a maximum of ~51C.

I’m just grateful that the bug doesn’t seem to be affecting the results, otherwise I would be freaking out. The slowdown is annoying but I’m not currently scaling to anything large enough to care about it (and - I haven’t checked this recently, but remember observing this in the past - I think the slowdowns went away when I went up to our “production” problem size).

zjw518 · December 22, 2017, 4:45pm

nvidia-smi -q shows that no clock throttling method is active during the program, and performance mode is P0.

zjw518 · December 28, 2017, 5:48pm

Okay, tentative discovery: by commenting out progressively less and less of my code, I was able to zero in on what may be the source of the issue. I was able to eliminate the issue by limiting the number of concurrent kernels I had in one part of my code - eight total in two groups of four, each group accessing different segments of the same array. I have concurrent kernels in similar style elsewhere, so it’s quite unclear to me what effect this could be having. Regardless, as has been evident, it’s been very difficult to make definitive statements in this testing since the behavior is so erratic and (seemingly) non-deterministic. I am hopeful that this is indeed “the” solution, but I am still worried that this could be a fluke.

zjw518 · December 28, 2017, 6:00pm

Never mind. Slowdown persists when nvprof is not used.

zjw518 · December 28, 2017, 7:39pm

Okay! Got rid of ALL overlapping kernel executions and there are no more slowdowns, with or without nvprof. Crossing my fingers that this persists…

njuffa · December 28, 2017, 7:52pm

Interesting observation / sleuthing. I have no idea what kind of cause-effect relationship tied to concurrent kernels would lead to these significant performance fluctuations.

zjw518 · December 28, 2017, 8:03pm

Okay! Got rid of ALL overlapping kernel executions and there are no more slowdowns, with or without nvprof. Crossing my fingers that this persists…

Edit: oops, double post from refresh. Yep, very strange.

Topic		Replies	Views
nvprof error Application received signal 11 CUDA Programming and Performance	10	5351	May 12, 2021
CUDA very slow performance CUDA Programming and Performance	21	16805	March 6, 2020
Low or normal performance? CUDA Programming and Performance cuda	20	1232	November 13, 2020
NVProf error on samples CUDA Programming and Performance	28	20484	December 29, 2020
Optimizing memory coalescence doesn't make my program faster CUDA Programming and Performance	12	494	August 6, 2021
My GPU Became Slower... after 1 month of not testing cuda CUDA Programming and Performance	18	12168	August 23, 2010
When WSL is faster than Windows?! CUDA on Windows Subsystem for Linux	21	4870	July 25, 2022
Converting a kernel from floats and ints to halfs is 6x slower CUDA Programming and Performance cuda	14	1058	October 16, 2023
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13534	July 9, 2008
Cuda code performance CUDA Programming and Performance	14	3175	December 16, 2014

Why would code run 1.7x faster when run with nvprof than without?

Related topics