Error when invoking a CUDA algorithm on two devices simultanously

I have a CUDA algorithm which works fine when I invoke it either on my Maxwell GPU (GTX 960) or Kepler GPU (GTX 770).

But when I invoke the CUDA algorithm simultanously on both devices (first CPU thread calls the CUDA algorithm on GPU 0, second CPU threads call the CUDA algorithm on GPU 1), I get an error (launch error), always on GPU 0 (GTX 960), after the first CUDA kernel call (which is a own-written convert function which converts a uint8 image to float image)

What am I doing wrong, why does this error only occur in that special setting ?

One guess of me is that when several GPUs are involved, then allocated images may get mapped into different GPU memory ranges, and maybe my convert routines has troubles for images allocated in a certain memory range.

Running the program with cuda-memcheck command line tool gives me

Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaDeviceS
ronize.
===== Saved host backtrace up to driver entry point at error
===== Host Frame:C:\Windows\system32\nvcuda.dll (cuMemcpy2D_v2_ptds + 0xa3f99e) [0xa6360b]
===== Host Frame:K:\common\libsUsage\JRSPointTrackerTest1\bin\cudart64_70.dll (cudaDeviceSynchronize + 0xf9) [0x1a699]
===== Host Frame:K:\common\libsUsage\JRSPointTrackerTest1\bin\CudaCVCore3.1_w64_vc120d.dll (cucv_Convert + 0x9e57) [0x1d887]
===== Host Frame:K:\common\libsUsage\JRSPointTrackerTest1\bin\CudaCVCore3.1_w64_vc120d.dll (cucv::Convert + 0x6f) [0x4caf]
===== Host Frame:K:\common\libsUsage\JRSPointTrackerTest1\bin\CudaCVCore3.1_w64_vc120d.dll (cucv_CreateLKPlan + 0xa45) [0x2afd5]

My system is Windows 7 (64-bit), Cuda Toolkit 7.0, latest Geforce Drivers.

I remember having had similar inexplicable startup issues when simultaneously launching work jobs on multiple GPUs (in my case it was for crypto coin mining with cudaminer)

My fix was to stagger the initialization phases of several GPU threads, so CUDA context creation and memory allocations of different threads would not overlap.

Christian

Hi thanks, something like that was also along my thoughts cause the error occurs at the very first kernel call on that device (the GTX 960).
I added now a ‘GPU warmup’ phase at the begin of the application, where l force the cuda context creation by doing a cudaMalloc / cudaFree on each device.
Unfortunately, thate did not help, i still get the same error.

Update: I updated my multi-GPU system (GTX 960 / GTX 770) to Windows 10, still get the same error. Furthermore, I tested the program on two other Windows Multi-GPU workstations (a mixed Kepler/Maxwell system with one Titan Black and one Titan X, and a pure Kepler system with two Quadro K6000). On both workstations, the executable failed also.
I filed a bug to NVIDIA.