I try to launch CUDA processes concurrently (I know they cannot run concurrently).
What I observe is that in the case of many processes (for example 32 concurrent processes) the 25 CUDA processes will run correctly (they return the correct result) but the 7 CUDA processes will end with very small latency and will return wrong result.
I do not expect to run concurrently or quickly the 32 concurrent different CUDA processes, but why the 7 out of 32 processes do not run correctly? And whenever I launch concurrently the 32 processes, always 7 processes do not run correctly.
Could you please explain me why?
Thanks in advance!