nvcuda hangs when 3 EXEs copy 10M data from Tesla C1060 at the same moment

Hi

I run 3 identical EXEs working with Tesla C1060 simultaneously. Time from time when all they copy 10M data from the device at the same moment control sticks infinetely somewhere inside nvcuda (see the stack frames below). Has anybody observed something similar? Is it a driver bug or hardware problem? Windows 7 64 bit, driver 332.59, toolkit 6.


nvcuda.dll!000007fee95ea76c() 	
[Frames below may be incorrect and/or missing, no symbols loaded for nvcuda.dll]	
nvcuda.dll!000007fee962cd4f() 	
nvcuda.dll!000007fee95ccc70() 	
nvcuda.dll!000007fee95863b5() 	
nvcuda.dll!000007fee9586431() 	
nvcuda.dll!000007fee95c9b4a() 	
nvcuda.dll!000007fee95c9e41() 	
nvcuda.dll!000007fee95b1f15() 	
nvcuda.dll!000007fee95b2505() 	
nvcuda.dll!000007fee95af7d1() 	
nvcuda.dll!000007fee9576502() 	
nvcuda.dll!000007fee95780e7() 	
nvcuda.dll!000007fee95577dc() 	
cudart64_55.dll!000007feeb4d47cd() 	
cudart64_55.dll!000007feeb4ecc75() 	
ArNIGPU.exe!bGetNetworkActivityHistory()  Line 227 + 0x2b bytes	C++

ArNIGPU.exe!main(int ARGC=7, char * * ARGV=0x0000000000527f10) Line 873 + 0x5 bytes C++
ArNIGPU.exe!__tmainCRTStartup() Line 555 + 0x19 bytes C


ntdll.dll!RtlEnterCriticalSection()  + 0xc bytes	
nvcuda.dll!000007fee95c8ea0() 	
[Frames below may be incorrect and/or missing, no symbols loaded for nvcuda.dll]	
nvcuda.dll!000007fee95c99c9() 	
nvcuda.dll!000007fee95c9e41() 	
nvcuda.dll!000007fee95b1f15() 	
nvcuda.dll!000007fee95b2505() 	
nvcuda.dll!000007fee95af7d1() 	
nvcuda.dll!000007fee9576502() 	
nvcuda.dll!000007fee95780e7() 	
nvcuda.dll!000007fee95577dc() 	
cudart64_55.dll!000007feeb4d47cd() 	
cudart64_55.dll!000007feeb4ecc75() 	
ArNIGPU.exe!bGetNetworkActivityHistory()  Line 227 + 0x2b bytes	C++

ArNIGPU.exe!main(int ARGC=7, char * * ARGV=0x0000000000707f10) Line 873 + 0x5 bytes C++
ArNIGPU.exe!__tmainCRTStartup() Line 555 + 0x19 bytes C

I had not observed this problem with GTX680 installed on the same computer…

Each of the respective threads inside nvcuda uses 100% of 1 CPU core time so that it is not a deadlock…

I determined that the problem is reproduced even if to run many times a singly copy of the same EXE - without concurrecy. The bug is reproducible but non-deterministic. It is observed on Tesla only - the code runs on GTX 680 normally. CUDA memory check does not show anything suspicious… It is very strange that cudaMemcpy does not report error but sipmply hangs. Could anyone advise how to find out what’s wrong?

Some thoughts on it:

  1. Maybe nvcuda.dll isn’t completely multi-thread/multi-process safe.

or assuming it is:

  1. Maybe it’s using the same cuda device context for all 3 exe’s.

So first thing I would do if I were you is try to figure out if it’s using the same device context.

So my question to you would be basically:

Do the EXE’s use the runtime library/api or driver library/api version of cuda ?

Perhaps some more information how they copies are done would help.

Is it from device to device ? or device to cpu, or cpu to device, memory ranges… etc ?

Thank you for your attention.

I reproduce the problem on a single process, so that concurrency is not relevant.

This EXE uses CUDA runtime 6th version. Data are compied from device to host using cudaMemcpy.

Wise people told me that it may be caused by presence of two very different architectures (1.3 and 3.0) served by the same driver so that now I simplify the situation - I removed GTX680 leaving only TESLA and is performing the same test…

I will report the result…

After removing GTX680 the problem remains…

I have no idea what may be wrong. Anyway, if I made some program error it is bad behavior of runtime function - to hang instead of reporting some error…