Performance Issue with GP100


I am having an issue with migrating from QuadroK4200 to QuadroGP100 and
would like to know if someone has any idea about what is happening.

I am debugging a huge project, it is hard to be very specific about it,
but basically there is a nested loop and I am using a mix of Thrust and
direct kernel calls.

The outter loop has a few vector copies and small kernels and
the inner loop has two kernels (it is a simulation, each one is a gauss
siedel solver). everything using thrust.

At first, the project was compiled with CUDA 8.0 in a machine with K4200.
When I first tried with GP100, the performance was much slower. (about
2x slower)
I recompiled with CUDA 9.1 and updated the driver. After that, it became
slower even on the K4200 (but still faster than GP100)

After profiling, I saw that the thrust calls were all being preceeded (and followed) by several drivers calls that I was unaware of, and their share in the performance was relatively big.

then, I rewrote the inner loops to call the kernels directly. The performance in both the machines improved, and the GP100 is slightly faster, but still something is odd…

when I profile the kernels, this what I get.


the outer loop is delimited by thin purple blocks.

So, what I can’t understand is why the GP100 has those periodic pauses. Each individual kernel is faster than the K4200 one, but the pause is big enought to affect the performance.

Also, I would like to know if is there any issues with Thrust and the new CUDA, or Pascal machines.

I know it is difficult to find a precise answer with just this info, but if someone
could give me directions to what might be the problem, I think I can pickup from there.

thank you.

What is the operating system platform being used? What are the actual performance numbers for the two GPUs? What is the limiting factor for performance according to the CUDA profiler?

Are the K4200 and the GP100 installed in the same (physically identical) system, i.e. this is a controlled experiment in which just the GPU is exchanged, and the entire system, hardware and software, is otherwise identical? If it is not a controlled experiment (only one variable is changed) I would suggest setting that up first to operate from a solid baseline.

With the little information provided the chances of pin-pointing a root cause remotely are likely very slim.

hi njuffa, thanks for your answer

we are running on windows 7 professional on both machines.
the gpus are installed on different machines with different specs, I can not change this… the project was developed for the K4200 and now we are testing the performance with the GP100 and will use the results in consideration to acquire the gpu.
With other softwares, there was no major problems, all of them showed some speed up in the order of 5x to 7x.
This algorithm I am testing was developed in house, it is still just a prototype and I was using thrust, but planning to move from it on later stages.

the limiting factor is memory access, but this was expected already. all of those short kernels are performing quick operations, but because it is not optimized yet, I am reading and writing the global memory repeatedly.

all the data is copied from host to device only once, the range I showed in the screenshots is a small portion after running for several steps. there is no host-device transfer (except for the kernel parameters)

I was suspecting of thrust, because before replacing the inner loops with cuda kernels, i was using thrust only and the GP100 was about 2 times slower!

these are the kind of operations I was doing before.

for (..nsubsteps..) 
 someDeviceVector1 = someOtherDeviceVector1;
 someDeviceVector2 = someOtherDeviceVector2;
 someDeviceVector3 = someOtherDeviceVector3;


and now:

for (...nsubsteps...)

the newer version runs very very slightly faster on the GP100, but far from the huge speed ups we are getting with other softwares.

what I cant figure out is why the gpu seems to be resting randomly during the execution. The profiler is catching nothing, although the runtime lane is always filled. It must be something silly I am missing here.

Each thrust call (even the vector copies) has a bunch of extra runtime calls before and after the main call. the cuda calls have only 2.

using thrust:

cudaLaunch ((actual kernel here))

and using simple cuda calls:

i think this is killing the performance, but i dont understand how that could be causing the long idle times… or maybe the reason is something entirely else.