I’ve developed an algorithm using cuda. It works fine when only one process invoking it.
But when multiple processes using it concurrently, the performance dropped observably, but the GPU occupancy was still low.
I noticed in cuda3.0, kernels from different context couldn’t run concurrently, I think that’s why the performance dropped. Is it correct??
So how can i prevent the performance dropping with multiple processes?? Since the algorithm using little blocks, it will be great if different kernels can run concurrently.
PS, why kernels from different context couldn’t run concurrently?? Is it because of limit of hardware design?
Thanks in advance.