Concurrent kernels execution using streams in multiple CPU threads

dwaras3284 · June 19, 2012, 9:20pm

Hi,

I’m new to concurrent kernel execution technique(supported by FERMI GPU). I need to use this technique in my project so as to get some performance gain. CUDA programming guide mentions that 16 kernels belonging to the same context can run concurrently.

Can anyone please clarify my questions regarding concurrent kernel execution?

I need to run the same kernel on different batches of input data. So, I’m planning to launch kernels concurrently, each one working on a batch of input data. I understand number of kernels that can concurrently depends on how much resources are being used by each of the kernels. I want to know how kernels run concurrently on the same GPU? How does the hardware handle different kernels with different thread configurations running at the same time on a GPU? Does nyone know how it works internally? Does 16 kernels run concurrently(overlap in true sense) on a GPU hardware? Practically, how many kernels can execute in parallel?
Second question, My project has multiple host threads. I’m planning to associate each stream ID with each CPU thread and each stream will run the kernel with a small batch of input. The other stream associated with the other thread runs on the second batch of the same input and likewise the kernels in other streams run its respective batches of the input. All the host threads n their respective streams are using the same context and same GPU. Is it possible to associate each host thread with stream Id and run it’s respective kernel?

It is very important for me to understand this before I use this technique for my project.

It’ll be really helpful if someone can clarify my questions.

Thanks

ymc · June 20, 2012, 12:47pm

At any given time, only one warp of threads are executed concurrently per SM/SMX
Yes

dwaras3284 · June 22, 2012, 9:13pm

I understand that at a given time, only one warp of threads run concurrently per SM. I’m asking how multiple kernels launched concurrently, run on the same GPU. We know, Fermi & Kepler supports up to 16 concurrent kernels. Each kernel will have its own thread configuration. How r they scheduled internally on hardware when they r launched simultaneously?
Are you sure this works? Same Context/ same GPU used by all the CPU threads

host thread#1 → stream #1 → kernel <<< … , stream 1>>>()

host thread#2 → stream #2 → kernel <<< …, stream 2>>>()

host thread#3 → stream# 3-> kernel<<<…, stream 3>>()

Does this work?

Thanks for your help.

Greg · June 23, 2012, 2:26am

Fermi GPUs can concurrently run launches with both different launch parameters and configurations (gridDim, blockDim, shared memory per thread, kernel parameters, …). Some state changes such as the cache configuration (PREFERRED_SHARED → PREFERED_L1) will serialize kernels.

You can easily modify the concurrentKernels SDK sample to show 16 concurrent kernels.

There is only a single communication channel per CUDA context to the GPU so work will execute in the order that it is launched. The work distributor will distribute all thread blocks from the first kernel before distributing work from the next kernel. Kernel thread blocks will be distributed in the order that they were launched but the thread blocks can complete out of order.

For more details see the webinar CUDA C/C++ Streams and Concurrency.

This is supported in CUDA 4.0 and above. It was possible to do this pre 4.0 using the CUDA Driver API but the interface was not as clean.

Parallel Nsight 2.0 and above, Nsight Visual Studio Edition 2.2, and the Visual Profiler 5.0 preview support Fermi concurrent kernel trace as well as displaying API call trace from multiple host threads. I recommend that you run the multiThreaded and concurrentKernel SDK sample in these tools to get a better understanding of what is supported.

I hope you find this information useful.

This statement is not true on either Fermi or Kepler.

ymc · June 23, 2012, 3:16am

Then what is true now? Two warps for Fermi and four warps for Kepler??

Justin_Luitjens · June 23, 2012, 3:55pm

For streams please watch this webinar

External Media

Greg · June 26, 2012, 2:53am

I didn’t answer the question as it was not relevant to your primary questions.

CC 2.x (Fermi) devices have two warp schedulers per SM.

On CC 2.0 devices each scheduler will pick an eligible warp on each cycle and issue 1 instructions on the warp.

On CC 2.1 devices each scheduler will pick an eligible warp on each cycle and issue 1 or 2 instructions on the warp.

CC 3.x (Kepler) devices have 4 warp schedulers per SM.

On CC 3.x devices each scheduler will pick an eligible warp on each cycle and issue 1 or 2 instructions on the warp.

The instruction issue rate depends on the number of eligible warps, the instruction mix of the kernel, and the amount of instruction level parallelism. The maximum IPC, eligible warps, active warps, and stall reasons are available in the Nsight Visual Studio Edition CUDA Profiler. Some of these stats are available in the CUDA profiler.

The SM datapaths are pipelined allowing a large number of instructions and memory operations to be in flight at any time.

dwaras3284 · June 26, 2012, 10:27pm

Fermi GPUs can concurrently run launches with both different launch parameters and configurations (gridDim, blockDim, shared memory per thread, kernel parameters, …). Some state changes such as the cache configuration (PREFERRED_SHARED → PREFERED_L1) will serialize kernels.

You can easily modify the concurrentKernels SDK sample to show 16 concurrent kernels.

There is only a single communication channel per CUDA context to the GPU so work will execute in the order that it is launched. The work distributor will distribute all thread blocks from the first kernel before distributing work from the next kernel. Kernel thread blocks will be distributed in the order that they were launched but the thread blocks can complete out of order.

For more details see the webinar CUDA C/C++ Streams and Concurrency.

This is supported in CUDA 4.0 and above. It was possible to do this pre 4.0 using the CUDA Driver API but the interface was not as clean.

Parallel Nsight 2.0 and above, Nsight Visual Studio Edition 2.2, and the Visual Profiler 5.0 preview support Fermi concurrent kernel trace as well as displaying API call trace from multiple host threads. I recommend that you run the multiThreaded and concurrentKernel SDK sample in these tools to get a better understanding of what is supported.

I hope you find this information useful.

This statement is not true on either Fermi or Kepler.

Thank you for the info. It was really helpful.

Topic		Replies	Views
Concurrently kernels running on one device CUDA Programming and Performance	17	2748	March 2, 2010
Concurrent Kernel Execution on Fermi - confussion CUDA Programming and Performance	13	1657	October 10, 2011
Concurrent kernel CUDA Programming and Performance	8	1749	January 14, 2024
Fermi streams and kernels CUDA Programming and Performance	5	1814	July 22, 2010
Easiest way to invoke two different kernels simultaneously ? CUDA Programming and Performance	4	5766	April 12, 2012
Concurrent kernel execution on Fermi How to implement it effectively ? CUDA Programming and Performance	1	597	October 14, 2010
Is it possible to execute two kernels concurrently? CUDA Programming and Performance	18	6664	July 2, 2010
Concurrent Kernel Execution and Context switching Problem CUDA Programming and Performance	11	8294	July 8, 2015
Kernel scheduling with Fermi independent blocks can be placed in new streams? CUDA Programming and Performance	14	13205	January 22, 2010
I have the following conceptual questions : CUDA Programming and Performance	6	707	August 15, 2017

Concurrent kernels execution using streams in multiple CPU threads

Related topics