Multi-GPU job launching - What's a good strategy ?

We have a multi-GPU framework where one can specifiy ‘jobs’ (which specify also on which GPU they shall be done) which are then executed on a specific GPU.
Currently, we have the approach that on startup of the framework we create one ‘Worker-Thread’ for each GPU which then waits for jobs to be processed. Specifically, we use the ‘GPUWorker’ class from https://devtalk.nvidia.com/search/more/sitecommentsearch/GPUworker/

It works nicely so far, but has some serious performance-related disadvantages:

  • In our frameowrk, a specific GPU is locked for the whole time of a ‘job’, even if the GPU is actually used only in 50 % of the time of the job. Note the jobs have a very coarse granurality, e.g. ‘do optical flow calculation’, which can take e.g. 50 - 100 milliseconds.
  • One can not specific ‘asynchronous’ jobs (e.g. an aysnchronous host-device copy) which do not lock the GPU

So I am now thinking about ‘better’ strategies for that problem.
My idea goes as following: For each new job which is ‘launched’, I create a new ‘temporary’ CPU thread. The CPU thread then sets the device number (via ‘cudaSetDevice’) of the GPU on which the work shall be done. I suppose at this time also (transparantly for me’ a Cuda context is created. After seeting the correct device, the ‘doWork’ function of the job is executed by the CPU thread. Dependent on whether the job shall be done synchronous or asynchronous, a ‘join’ is done (waiting for the CPU thread for completion) or not.

I have now several questions:

  • Is that a ‘good’ strategy, or does somebody know of a better way how to handle this ? Of course it must be a thread-safe strategy.
  • In my proposed strategy, what is the typical overhead (in milliseconds) of the creation of the new CPU thread and the (hidden) creation of the Cuda context) ? Furthermore, if e.g. the creation of the Cuda context is signficiant, is there a way (e.g. using the cuda device api and some sort of ‘context migration’) to reduce this overhead ?

many thx for any help on this.

mapping of a CPU thread to a GPU context is the best model and the input data needs to be independent among the GPUs .

Gather and summing the data from all GPUs is to be done to get final results.