I am having an issue with migrating from QuadroK4200 to QuadroGP100 and
would like to know if someone has any idea about what is happening.
I am debugging a huge project, it is hard to be very specific about it,
but basically there is a nested loop and I am using a mix of Thrust and
direct kernel calls.
The outter loop has a few vector copies and small kernels and
the inner loop has two kernels (it is a simulation, each one is a gauss
siedel solver). everything using thrust.
At first, the project was compiled with CUDA 8.0 in a machine with K4200.
When I first tried with GP100, the performance was much slower. (about
I recompiled with CUDA 9.1 and updated the driver. After that, it became
slower even on the K4200 (but still faster than GP100)
After profiling, I saw that the thrust calls were all being preceeded (and followed) by several drivers calls that I was unaware of, and their share in the performance was relatively big.
then, I rewrote the inner loops to call the kernels directly. The performance in both the machines improved, and the GP100 is slightly faster, but still something is odd…
when I profile the kernels, this what I get.
the outer loop is delimited by thin purple blocks.
So, what I can’t understand is why the GP100 has those periodic pauses. Each individual kernel is faster than the K4200 one, but the pause is big enought to affect the performance.
Also, I would like to know if is there any issues with Thrust and the new CUDA, or Pascal machines.
I know it is difficult to find a precise answer with just this info, but if someone
could give me directions to what might be the problem, I think I can pickup from there.