I’m working with a legacy C code, from which I would like to ship out compute-expensive routines/code blocks to the GPU using CUDA. I have a GPU that is mounted on a workstation with 2 dual-core CPUs. My legacy code is multi-process code, which means that I have routines that are running concurrently in separate parallel processes on all the 4 cores, and all of these routines would like to access the GPU. Is there a way by which I could schedule tasks from all 4 cores to run concurrently on the GPU, or use the GPU optimally?
Also, CUDA contexts of different threads|processes will live in separate address spaces, this might be worth noting, you cannot share data between them on the GPU.
Can I run different streams from a single kernel on (i) separate processes and (ii) separate threads? Could this be a way to overlap data transfer and compute for separate processes and/or separate threads?