Random execution times and freezes with concurent kernels

Hello,

In my code I use MPI and about 50 kernels executing in loop many times. Some of them permit concurent execution therefore I launch them in different streams with appropriate cudaStreamSynchronize calls. Without special device flags I get the following (1-procs MPI program):

Kernels use various amount of local memory (~500-1800 bytes) resizing between launches what leads to serializing and additional overhead.

With cudaSetDeviceFlags(cudaDeviceLmemResizeToMax) kernels run concurently:

but some of them have large random increase of execution time(marked by red).
For example predictor_gpu_w_up takes time from 1.3 ms to 97.4 ms

With CUDA_LAUNCH_BLOCKING=1 this increase fully disappears but overall parformance is poor:

Now predictor_gpu_w_up always takes 0.6-0.7 ms.

When I run task in multiple mpi proccesses without cudaDeviceLmemResizeToMax or with CUDA_LAUNCH_BLOCKING=1 code works and finish correctly (with poor performance). But with cudaDeviceLmemResizeToMax and CUDA_LAUNCH_BLOCKING=0 I get again random time increases for some kernels or even freeze (in random iteration in loop) whole program. In last case I see (via nvidia-smi) that only one GPU has 99% permanent load while others - 0%.

I observe this problem on 2 different clusters with cuda 5.0 + intel mpi (4.0,4.1) or openmpi (1.5.5).

Cluster1: GPUs - Tesla C2050, driver 310.40
Cluster2: GPUs - Tesla X2070, driver 319.17

I try to reduce number of using streams from 12 to 4 but it didn’t solve the problem.
When I run several times the same task without nvprof total execution time of main compute loop varies about 25% and i get random freezes in multiprocs launch and so I suppose that profiler is not root of the problem.

How to fix it and get stable kernel timings with concurent execution?

I reduce number of streams to 2 - one for computations and another for D <-> H transfers and code now works with stable execution times without freezes for multiprocs task.

But efficiency drops because of multiple kernels (each with ~1000 threads only) launch sequentially (as with CUDA_LAUNCH_BLOCKING=1).