Is it possible to execute two kernels concurrently?

Karguvel_RajanRamachandra · June 15, 2010, 5:35am

Hi Experts,

      I am calling two kernels from two different host threads. Assume that both kernels are not going to use the all the cores in GPU.If the two kernels called at same time then what will happen? will they execute concurrently? or one kernel wait for other one two complete?

SPWorley · June 15, 2010, 5:51am

Only GF100 devices can run more than one kernel at once, and only when those kernels are in the same CUDA context.

If you’re using the default runtime calling method, then two threads will create two different CUDA contexts, and they cannot run simultaneously.

Sarnath · June 15, 2010, 8:48am

Rajannaaaa…,

i.e. the compute capability of your device needs to be 2.0

1.3 - TESLA 10 series
2.0 - TESLA 20 series (aka FERMI)

You can run the “deviceQuery” executable that comes along with CUDA Toolkit to query your device capability.

OR You can use “cudaGetDeviceProperties” CUDA Run-Time API to get that info. HTH

laughingrice · June 15, 2010, 10:15am

No, if you call them from different threads (different contexts) they will always run one after the other and not at the same time

With fermi, if you call them from the same context using different streams they can run at the same time (up to 14 kernels if memory serves)

With older devices they will always run one after the other

Karguvel_RajanRamachandra · June 15, 2010, 11:18am

K. Thank you guys… :)

seibert · June 15, 2010, 2:17pm

4 with CUDA 3.0 and 16 with CUDA 3.1 beta.

parallelis · June 28, 2010, 2:14pm

With CUDA 3.1 (under Mac OS X 10.6.4) you could run kernels in parallel even with non-GF100.
I use a GeForce 9600M GT on my laptop, and it could run up to 4 kernels simultaneously, you could check it using concurrentKernels on the SDK examples.

[i][concurrentKernels] - Starting…

CUDA Device GeForce 9600M GT has 4 Multi-Processors
CUDA Device GeForce 9600M GT is capable of concurrent kernel execution

All 8 kernels together took 0.753s
(~0.094s per kernel * 8 kernels = ~0.751s if no concurrent execution)

Cleaning up…[/i]

*** ADDED ***
On the GeForce 9400M on the same computer, kernel execution time is far worse, as if concurrent execution wasn’t really supported by the hardware, even if concurrentKernels states it does:

[i][concurrentKernels] - Starting…

CUDA Device GeForce 9400M has 2 Multi-Processors
CUDA Device GeForce 9400M is capable of concurrent kernel execution

All 8 kernels together took 1.635s
(~0.104s per kernel * 8 kernels = ~0.828s if no concurrent execution)

Cleaning up…[/i]

I have to investigate further on concurrentKernels code, because launching concurrent kernels on GPU is a hot topic for me :)

laughingrice · June 28, 2010, 7:08pm

With CUDA 3.1 (under Mac OS X 10.6.4) you could run kernels in parallel even with non-GF100.

I use a GeForce 9600M GT on my laptop, and it could run up to 4 kernels simultaneously, you could check it using concurrentKernels on the SDK examples.

[i][concurrentKernels] - Starting…

CUDA Device GeForce 9600M GT has 4 Multi-Processors

CUDA Device GeForce 9600M GT is capable of concurrent kernel execution

All 8 kernels together took 0.753s

(~0.094s per kernel * 8 kernels = ~0.751s if no concurrent execution)

Cleaning up…[/i]

*** ADDED ***

On the GeForce 9400M on the same computer, kernel execution time is far worse, as if concurrent execution wasn’t really supported by the hardware, even if concurrentKernels states it does:

[i][concurrentKernels] - Starting…

CUDA Device GeForce 9400M has 2 Multi-Processors

CUDA Device GeForce 9400M is capable of concurrent kernel execution

All 8 kernels together took 1.635s

(~0.104s per kernel * 8 kernels = ~0.828s if no concurrent execution)

Cleaning up…[/i]

I have to investigate further on concurrentKernels code, because launching concurrent kernels on GPU is a hot topic for me :)

Sounds to me that it’s badly reporting concurrent kernel support. It should only be possible with fermi due to hardware limitations. My gt240 for example reports no concurrent kernel support.

Cliff_Woolley · June 29, 2010, 10:52pm

Hmm… this sounds like a driver reporting bug. Concurrent kernels are a Fermi feature, as others have said on this thread.

As for how to know whether you’re actually getting concurrency or not, look at the timings…

0.753s is how long it took 8 kernels to run back-to-back. 0.094s is how long it took to run a single kernel. 0.094 * 8 == 0.751, and 0.753 is pretty close to 0.751, meaning that the kernels must have run one after the other rather than concurrently. And since GeForce 9600M does not support concurrent kernels (the incorrect reporting notwithstanding), I can guarantee that this is exactly what’s happening: no concurrency.

If you were actually getting concurrency, the time for all 8 kernels together should be LESS than the per-kernel time multiplied by the number of kernels.

–Cliff

Cliff_Woolley · June 29, 2010, 10:52pm

Hmm… this sounds like a driver reporting bug. Concurrent kernels are a Fermi feature, as others have said on this thread.

As for how to know whether you’re actually getting concurrency or not, look at the timings…

0.753s is how long it took 8 kernels to run back-to-back. 0.094s is how long it took to run a single kernel. 0.094 * 8 == 0.751, and 0.753 is pretty close to 0.751, meaning that the kernels must have run one after the other rather than concurrently. And since GeForce 9600M does not support concurrent kernels (the incorrect reporting notwithstanding), I can guarantee that this is exactly what’s happening: no concurrency.

If you were actually getting concurrency, the time for all 8 kernels together should be LESS than the per-kernel time multiplied by the number of kernels.

–Cliff

parallelis · July 1, 2010, 4:12pm

There’s no indication on the output on how much threads (or WARPS) are launched for these kernels, so how could you see that it is 8 kernels that run sequentially, or 8 kernels running in parallel, each one launching multiple warps on each SM?

I would have to read the source-code, and it’s troublesome to not have a direct simple message saying “sequential run” or “parallel run”…

parallelis · July 1, 2010, 4:12pm

There’s no indication on the output on how much threads (or WARPS) are launched for these kernels, so how could you see that it is 8 kernels that run sequentially, or 8 kernels running in parallel, each one launching multiple warps on each SM?

I would have to read the source-code, and it’s troublesome to not have a direct simple message saying “sequential run” or “parallel run”…

Cliff_Woolley · July 1, 2010, 4:41pm

Because I wrote the test and know how it works. :) Of course this also means that any blame for its lack of helpfulness falls on me…

You don’t need to know how many warps are launched for each kernel – just know that they’re relatively short-running kernels that could hypothetically benefit from some concurrency if such concurrency is supported by the hardware and driver.

Going back to the output lines I quoted earlier:

As I mentioned before, The “All 8 kernels together” line is reporting the elapsed time of something like this pseudocode (call it “Test A”):

start timer

for (i=1 to 8)

{

   run kernel in stream i

}

sync all

stop timer

The second line, “X s per kernel” is (separately) measuring a single run of the kernel like this (call it “Test B”):

start timer

run kernel

sync all

stop timer

If the 8 kernels in the Test A had executed sequentially, the time it would have taken all 8 to run would be approximately equal to the time for Test B multiplied by 8. Hence 0.753 would be approximately equal to 0.751, which it is. So your Test A kernels ran sequentially. (The reason the times aren’t exactly equal is the small amount of driver overhead between kernel launches in Test A that’s not accounted for by Test B. In this case, the Test A time should be slightly higher than the Test B time multiplied by 8, which it is.)

If the 8 kernels in Test A had been able to run with any degree of concurrency, what we’d see instead is that the Test A time would be substantially less than the Test B time multiplied by 8, meaning that you can run 8 kernels in separate streams faster than you can run 8 kernels sequentially – which can only happen if the kernels in Test A showed some degree of concurrency.

Does that help?

It’s a little tricky to detect this with certainty due to slight timing variations – but I guess I could give it a try. I’ll make myself a note to work on that for a future version of the CUDA SDK.

–Cliff

Cliff_Woolley · July 1, 2010, 4:41pm

Because I wrote the test and know how it works. :) Of course this also means that any blame for its lack of helpfulness falls on me…

You don’t need to know how many warps are launched for each kernel – just know that they’re relatively short-running kernels that could hypothetically benefit from some concurrency if such concurrency is supported by the hardware and driver.

Going back to the output lines I quoted earlier:

As I mentioned before, The “All 8 kernels together” line is reporting the elapsed time of something like this pseudocode (call it “Test A”):

start timer

for (i=1 to 8)

{

   run kernel in stream i

}

sync all

stop timer

The second line, “X s per kernel” is (separately) measuring a single run of the kernel like this (call it “Test B”):

start timer

run kernel

sync all

stop timer

If the 8 kernels in the Test A had executed sequentially, the time it would have taken all 8 to run would be approximately equal to the time for Test B multiplied by 8. Hence 0.753 would be approximately equal to 0.751, which it is. So your Test A kernels ran sequentially. (The reason the times aren’t exactly equal is the small amount of driver overhead between kernel launches in Test A that’s not accounted for by Test B. In this case, the Test A time should be slightly higher than the Test B time multiplied by 8, which it is.)

If the 8 kernels in Test A had been able to run with any degree of concurrency, what we’d see instead is that the Test A time would be substantially less than the Test B time multiplied by 8, meaning that you can run 8 kernels in separate streams faster than you can run 8 kernels sequentially – which can only happen if the kernels in Test A showed some degree of concurrency.

Does that help?

It’s a little tricky to detect this with certainty due to slight timing variations – but I guess I could give it a try. I’ll make myself a note to work on that for a future version of the CUDA SDK.

–Cliff

parallelis · July 1, 2010, 6:53pm

Thnaks Cliff, it’s always interesting and enlightening to have infos from the developer itself!

So if you mean that these kernels taking the same time with concurrent launch means they are executing sequentially, how would you translate that on the GeForce 9400M the time doubled. This is what is actually troubling for me…
Time should have been same or similar on 9400M on the two run, especially if it DOESN’T SUPPORT parallel execution. Raising execution time par a 2X factor should have been an effect of running kernel in parallel and struggling for resources (dimishing efficiency) such as memory (in this case DDR3 main memory).

Anyway, it’s really interesting to know for sure that even if driver report concurrent kernel execution, only GT200 and over could offer true support.

parallelis · July 1, 2010, 6:53pm

Thnaks Cliff, it’s always interesting and enlightening to have infos from the developer itself!

So if you mean that these kernels taking the same time with concurrent launch means they are executing sequentially, how would you translate that on the GeForce 9400M the time doubled. This is what is actually troubling for me…
Time should have been same or similar on 9400M on the two run, especially if it DOESN’T SUPPORT parallel execution. Raising execution time par a 2X factor should have been an effect of running kernel in parallel and struggling for resources (dimishing efficiency) such as memory (in this case DDR3 main memory).

Anyway, it’s really interesting to know for sure that even if driver report concurrent kernel execution, only GT200 and over could offer true support.

Cliff_Woolley · July 1, 2010, 6:57pm

Not GT200… GF100 (aka Fermi).

This part I actually can’t explain off the top of my head. I’d have to run it through the Visual Profiler to figure out what’s going on there. It’s on my to-do list.

–Cliff

Cliff_Woolley · July 1, 2010, 6:57pm

Not GT200… GF100 (aka Fermi).

This part I actually can’t explain off the top of my head. I’d have to run it through the Visual Profiler to figure out what’s going on there. It’s on my to-do list.

–Cliff

parallelis · July 2, 2010, 11:49pm

Thanks Cliff!

I am working on a little open-source CUDA benchmark and plan to include concurrent execution test External Image

Topic		Replies	Views
Kernels launch - parallel or serial? CUDA Programming and Performance	16	6812	January 11, 2010
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20121	May 4, 2007
Concurrently kernels running on one device CUDA Programming and Performance	17	2724	March 2, 2010
Concurrent Kernels Bug / Undocumented Behavior (Urgent) need info on "simple" problem with c CUDA Programming and Performance	2	906	June 18, 2010
Concurrent kernels execution using streams in multiple CPU threads CUDA Programming and Performance	7	10597	June 26, 2012
launch kernels in parallel? CUDA Programming and Performance	16	23980	July 29, 2010
GPU-CPU & GPU-GPU synchronization query on advanced CUDA features CUDA Programming and Performance	12	17409	June 14, 2008
Multiple kernels in flight? CUDA Programming and Performance	19	26829	August 28, 2007
CUDA 3.0: concurrent kernel launches CUDA Programming and Performance	9	17724	April 1, 2010
CUDA won't concurrently run kernels on multiple devices from within same process CUDA Programming and Performance cuda , performance , gpu	1	1127	January 27, 2023

Is it possible to execute two kernels concurrently?

Related topics