Multiple concurrent device processes using multiple concurrent host threads

I read through 4.5.1.1 of CUDA Programming Guide which says “Several host threads can execute device code on the same device”.

I was thinking, can multiple host threads be used to invoke concurrent kernels on a device? Could this be a way to invoke concurrent “processes” on a GPU (not sure if the kernels invoked by different host threads will be executed concurrently)? Has someone tried something like this?

What are the possible ways of executing multiple concurrent “processes” on a GPU? I know that there is no straightforward CUDA support for invoking concurrent kernels (atleast for mainstream GPUs).

Request share your experiences.

Thanks,

Aditi

The driver serializes multiple kernel calls on the same GPU (currently available hardware), whether they are from multiple host threads or not.

Future hardware may allow this.

I was reading through 4.5.1.5 of the Programming Guide 2.0 (Asynchronous Concurrent Execution using “Streams”). If I launch multiple streams and each of these invoke kernels on the device, will these kernels be executed in Parallel or Sequentially or Scheduled?

What if multiple pthreads on the host simultaneously invoke multiple kernels on the device?

Another question:

I have 8SMs and 8x16=128SPs in my device. If I invoke a kernel (say, with number of blocks = 128 at least), what I understand is, all the SMs will be involved and there will be one block executed per SP. My question is, can I control that the device uses only a certain number of SMs (Say 8SMs, which would mean two blocks scheduled over each SP)?

ps: SM = Stream Multi-processor and SP = Stream (or Thread) Processor.

Thanks,

Aditi

This is incorrect. Blocks are assigned to stream multiprocessors, not stream processors. The threads within a block are distributed over the stream processors within the multiprocessor. A SM can time slice between multiple blocks, if there is sufficient register and shared memory resources to do so.

There is no trickery you can do here to get truly concurrent execution of different kernels.