Concurrent kernels execution using streams in multiple CPU threads


I’m new to concurrent kernel execution technique(supported by FERMI GPU). I need to use this technique in my project so as to get some performance gain. CUDA programming guide mentions that 16 kernels belonging to the same context can run concurrently.

Can anyone please clarify my questions regarding concurrent kernel execution?

  1. I need to run the same kernel on different batches of input data. So, I’m planning to launch kernels concurrently, each one working on a batch of input data. I understand number of kernels that can concurrently depends on how much resources are being used by each of the kernels. I want to know how kernels run concurrently on the same GPU? How does the hardware handle different kernels with different thread configurations running at the same time on a GPU? Does nyone know how it works internally? Does 16 kernels run concurrently(overlap in true sense) on a GPU hardware? Practically, how many kernels can execute in parallel?

  2. Second question, My project has multiple host threads. I’m planning to associate each stream ID with each CPU thread and each stream will run the kernel with a small batch of input. The other stream associated with the other thread runs on the second batch of the same input and likewise the kernels in other streams run its respective batches of the input. All the host threads n their respective streams are using the same context and same GPU. Is it possible to associate each host thread with stream Id and run it’s respective kernel?

It is very important for me to understand this before I use this technique for my project.

It’ll be really helpful if someone can clarify my questions.


  1. At any given time, only one warp of threads are executed concurrently per SM/SMX

  2. Yes

  1. I understand that at a given time, only one warp of threads run concurrently per SM. I’m asking how multiple kernels launched concurrently, run on the same GPU. We know, Fermi & Kepler supports up to 16 concurrent kernels. Each kernel will have its own thread configuration. How r they scheduled internally on hardware when they r launched simultaneously?

  2. Are you sure this works? Same Context/ same GPU used by all the CPU threads

host thread#1 -> stream #1 -> kernel <<< … , stream 1>>>()

host thread#2 -> stream #2 -> kernel <<< …, stream 2>>>()

host thread#3 -> stream# 3-> kernel<<<…, stream 3>>()

Does this work?

Thanks for your help.

Fermi GPUs can concurrently run launches with both different launch parameters and configurations (gridDim, blockDim, shared memory per thread, kernel parameters, …). Some state changes such as the cache configuration (PREFERRED_SHARED -> PREFERED_L1) will serialize kernels.

You can easily modify the concurrentKernels SDK sample to show 16 concurrent kernels.

There is only a single communication channel per CUDA context to the GPU so work will execute in the order that it is launched. The work distributor will distribute all thread blocks from the first kernel before distributing work from the next kernel. Kernel thread blocks will be distributed in the order that they were launched but the thread blocks can complete out of order.

For more details see the webinar CUDA C/C++ Streams and Concurrency.

This is supported in CUDA 4.0 and above. It was possible to do this pre 4.0 using the CUDA Driver API but the interface was not as clean.

Parallel Nsight 2.0 and above, Nsight Visual Studio Edition 2.2, and the Visual Profiler 5.0 preview support Fermi concurrent kernel trace as well as displaying API call trace from multiple host threads. I recommend that you run the multiThreaded and concurrentKernel SDK sample in these tools to get a better understanding of what is supported.

I hope you find this information useful.

This statement is not true on either Fermi or Kepler.

Then what is true now? Two warps for Fermi and four warps for Kepler??

For streams please watch this webinar

I didn’t answer the question as it was not relevant to your primary questions.

CC 2.x (Fermi) devices have two warp schedulers per SM.

On CC 2.0 devices each scheduler will pick an eligible warp on each cycle and issue 1 instructions on the warp.

On CC 2.1 devices each scheduler will pick an eligible warp on each cycle and issue 1 or 2 instructions on the warp.

CC 3.x (Kepler) devices have 4 warp schedulers per SM.

On CC 3.x devices each scheduler will pick an eligible warp on each cycle and issue 1 or 2 instructions on the warp.

The instruction issue rate depends on the number of eligible warps, the instruction mix of the kernel, and the amount of instruction level parallelism. The maximum IPC, eligible warps, active warps, and stall reasons are available in the Nsight Visual Studio Edition CUDA Profiler. Some of these stats are available in the CUDA profiler.

The SM datapaths are pipelined allowing a large number of instructions and memory operations to be in flight at any time.

Thank you for the info. It was really helpful.