In my code I use MPI and about 50 kernels executing in loop many times. Some of them permit concurent execution therefore I launch them in different streams with appropriate cudaStreamSynchronize calls. Without special device flags I get the following (1-procs MPI program):
Kernels use various amount of local memory (~500-1800 bytes) resizing between launches what leads to serializing and additional overhead.
With cudaSetDeviceFlags(cudaDeviceLmemResizeToMax) kernels run concurently:
but some of them have large random increase of execution time(marked by red).
For example predictor_gpu_w_up takes time from 1.3 ms to 97.4 ms
With CUDA_LAUNCH_BLOCKING=1 this increase fully disappears but overall parformance is poor:
Now predictor_gpu_w_up always takes 0.6-0.7 ms.
When I run task in multiple mpi proccesses without cudaDeviceLmemResizeToMax or with CUDA_LAUNCH_BLOCKING=1 code works and finish correctly (with poor performance). But with cudaDeviceLmemResizeToMax and CUDA_LAUNCH_BLOCKING=0 I get again random time increases for some kernels or even freeze (in random iteration in loop) whole program. In last case I see (via nvidia-smi) that only one GPU has 99% permanent load while others - 0%.
I observe this problem on 2 different clusters with cuda 5.0 + intel mpi (4.0,4.1) or openmpi (1.5.5).
Cluster1: GPUs - Tesla C2050, driver 310.40
Cluster2: GPUs - Tesla X2070, driver 319.17
I try to reduce number of using streams from 12 to 4 but it didn’t solve the problem.
When I run several times the same task without nvprof total execution time of main compute loop varies about 25% and i get random freezes in multiprocs launch and so I suppose that profiler is not root of the problem.
How to fix it and get stable kernel timings with concurent execution?