I have a CUDA algorithm which works fine when I invoke it either on my Maxwell GPU (GTX 960) or Kepler GPU (GTX 770).
But when I invoke the CUDA algorithm simultanously on both devices (first CPU thread calls the CUDA algorithm on GPU 0, second CPU threads call the CUDA algorithm on GPU 1), I get an error (launch error), always on GPU 0 (GTX 960), after the first CUDA kernel call (which is a own-written convert function which converts a uint8 image to float image)
What am I doing wrong, why does this error only occur in that special setting ?
One guess of me is that when several GPUs are involved, then allocated images may get mapped into different GPU memory ranges, and maybe my convert routines has troubles for images allocated in a certain memory range.
Running the program with cuda-memcheck command line tool gives me
Program hit cudaErrorLaunchFailure (error 4) due to “unspecified launch failure” on CUDA API call to cudaDeviceS
ronize.
===== Saved host backtrace up to driver entry point at error
===== Host Frame:C:\Windows\system32\nvcuda.dll (cuMemcpy2D_v2_ptds + 0xa3f99e) [0xa6360b]
===== Host Frame:K:\common\libsUsage\JRSPointTrackerTest1\bin\cudart64_70.dll (cudaDeviceSynchronize + 0xf9) [0x1a699]
===== Host Frame:K:\common\libsUsage\JRSPointTrackerTest1\bin\CudaCVCore3.1_w64_vc120d.dll (cucv_Convert + 0x9e57) [0x1d887]
===== Host Frame:K:\common\libsUsage\JRSPointTrackerTest1\bin\CudaCVCore3.1_w64_vc120d.dll (cucv::Convert + 0x6f) [0x4caf]
===== Host Frame:K:\common\libsUsage\JRSPointTrackerTest1\bin\CudaCVCore3.1_w64_vc120d.dll (cucv_CreateLKPlan + 0xa45) [0x2afd5]
My system is Windows 7 (64-bit), Cuda Toolkit 7.0, latest Geforce Drivers.