Is it possible to execute two kernels concurrently?

Hi Experts,

      I am calling two kernels from two different host threads. Assume that both kernels are not going to use the all the cores in GPU.If the two kernels called at same time then what will happen? will they execute concurrently? or one kernel wait for other one two complete?

Only GF100 devices can run more than one kernel at once, and only when those kernels are in the same CUDA context.

If you’re using the default runtime calling method, then two threads will create two different CUDA contexts, and they cannot run simultaneously.

Rajannaaaa…,

i.e. the compute capability of your device needs to be 2.0

1.3 - TESLA 10 series
2.0 - TESLA 20 series (aka FERMI)

You can run the “deviceQuery” executable that comes along with CUDA Toolkit to query your device capability.

OR You can use “cudaGetDeviceProperties” CUDA Run-Time API to get that info. HTH

No, if you call them from different threads (different contexts) they will always run one after the other and not at the same time

With fermi, if you call them from the same context using different streams they can run at the same time (up to 14 kernels if memory serves)

With older devices they will always run one after the other

K. Thank you guys… :)

4 with CUDA 3.0 and 16 with CUDA 3.1 beta.

With CUDA 3.1 (under Mac OS X 10.6.4) you could run kernels in parallel even with non-GF100.
I use a GeForce 9600M GT on my laptop, and it could run up to 4 kernels simultaneously, you could check it using concurrentKernels on the SDK examples.

[i][concurrentKernels] - Starting…

CUDA Device GeForce 9600M GT has 4 Multi-Processors
CUDA Device GeForce 9600M GT is capable of concurrent kernel execution

All 8 kernels together took 0.753s
(~0.094s per kernel * 8 kernels = ~0.751s if no concurrent execution)

Cleaning up…[/i]

*** ADDED ***
On the GeForce 9400M on the same computer, kernel execution time is far worse, as if concurrent execution wasn’t really supported by the hardware, even if concurrentKernels states it does:

[i][concurrentKernels] - Starting…

CUDA Device GeForce 9400M has 2 Multi-Processors
CUDA Device GeForce 9400M is capable of concurrent kernel execution

All 8 kernels together took 1.635s
(~0.104s per kernel * 8 kernels = ~0.828s if no concurrent execution)

Cleaning up…[/i]

I have to investigate further on concurrentKernels code, because launching concurrent kernels on GPU is a hot topic for me :)

Sounds to me that it’s badly reporting concurrent kernel support. It should only be possible with fermi due to hardware limitations. My gt240 for example reports no concurrent kernel support.

Hmm… this sounds like a driver reporting bug. Concurrent kernels are a Fermi feature, as others have said on this thread.

As for how to know whether you’re actually getting concurrency or not, look at the timings…

0.753s is how long it took 8 kernels to run back-to-back. 0.094s is how long it took to run a single kernel. 0.094 * 8 == 0.751, and 0.753 is pretty close to 0.751, meaning that the kernels must have run one after the other rather than concurrently. And since GeForce 9600M does not support concurrent kernels (the incorrect reporting notwithstanding), I can guarantee that this is exactly what’s happening: no concurrency.

If you were actually getting concurrency, the time for all 8 kernels together should be LESS than the per-kernel time multiplied by the number of kernels.

–Cliff

Hmm… this sounds like a driver reporting bug. Concurrent kernels are a Fermi feature, as others have said on this thread.

As for how to know whether you’re actually getting concurrency or not, look at the timings…

0.753s is how long it took 8 kernels to run back-to-back. 0.094s is how long it took to run a single kernel. 0.094 * 8 == 0.751, and 0.753 is pretty close to 0.751, meaning that the kernels must have run one after the other rather than concurrently. And since GeForce 9600M does not support concurrent kernels (the incorrect reporting notwithstanding), I can guarantee that this is exactly what’s happening: no concurrency.

If you were actually getting concurrency, the time for all 8 kernels together should be LESS than the per-kernel time multiplied by the number of kernels.

–Cliff

There’s no indication on the output on how much threads (or WARPS) are launched for these kernels, so how could you see that it is 8 kernels that run sequentially, or 8 kernels running in parallel, each one launching multiple warps on each SM?

I would have to read the source-code, and it’s troublesome to not have a direct simple message saying “sequential run” or “parallel run”…

There’s no indication on the output on how much threads (or WARPS) are launched for these kernels, so how could you see that it is 8 kernels that run sequentially, or 8 kernels running in parallel, each one launching multiple warps on each SM?

I would have to read the source-code, and it’s troublesome to not have a direct simple message saying “sequential run” or “parallel run”…

Because I wrote the test and know how it works. :) Of course this also means that any blame for its lack of helpfulness falls on me…

You don’t need to know how many warps are launched for each kernel – just know that they’re relatively short-running kernels that could hypothetically benefit from some concurrency if such concurrency is supported by the hardware and driver.

Going back to the output lines I quoted earlier:

As I mentioned before, The “All 8 kernels together” line is reporting the elapsed time of something like this pseudocode (call it “Test A”):

start timer

for (i=1 to 8)

{

   run kernel in stream i

}

sync all

stop timer

The second line, “X s per kernel” is (separately) measuring a single run of the kernel like this (call it “Test B”):

start timer

run kernel

sync all

stop timer

If the 8 kernels in the Test A had executed sequentially, the time it would have taken all 8 to run would be approximately equal to the time for Test B multiplied by 8. Hence 0.753 would be approximately equal to 0.751, which it is. So your Test A kernels ran sequentially. (The reason the times aren’t exactly equal is the small amount of driver overhead between kernel launches in Test A that’s not accounted for by Test B. In this case, the Test A time should be slightly higher than the Test B time multiplied by 8, which it is.)

If the 8 kernels in Test A had been able to run with any degree of concurrency, what we’d see instead is that the Test A time would be substantially less than the Test B time multiplied by 8, meaning that you can run 8 kernels in separate streams faster than you can run 8 kernels sequentially – which can only happen if the kernels in Test A showed some degree of concurrency.

Does that help?

It’s a little tricky to detect this with certainty due to slight timing variations – but I guess I could give it a try. I’ll make myself a note to work on that for a future version of the CUDA SDK.

–Cliff

Because I wrote the test and know how it works. :) Of course this also means that any blame for its lack of helpfulness falls on me…

You don’t need to know how many warps are launched for each kernel – just know that they’re relatively short-running kernels that could hypothetically benefit from some concurrency if such concurrency is supported by the hardware and driver.

Going back to the output lines I quoted earlier:

As I mentioned before, The “All 8 kernels together” line is reporting the elapsed time of something like this pseudocode (call it “Test A”):

start timer

for (i=1 to 8)

{

   run kernel in stream i

}

sync all

stop timer

The second line, “X s per kernel” is (separately) measuring a single run of the kernel like this (call it “Test B”):

start timer

run kernel

sync all

stop timer

If the 8 kernels in the Test A had executed sequentially, the time it would have taken all 8 to run would be approximately equal to the time for Test B multiplied by 8. Hence 0.753 would be approximately equal to 0.751, which it is. So your Test A kernels ran sequentially. (The reason the times aren’t exactly equal is the small amount of driver overhead between kernel launches in Test A that’s not accounted for by Test B. In this case, the Test A time should be slightly higher than the Test B time multiplied by 8, which it is.)

If the 8 kernels in Test A had been able to run with any degree of concurrency, what we’d see instead is that the Test A time would be substantially less than the Test B time multiplied by 8, meaning that you can run 8 kernels in separate streams faster than you can run 8 kernels sequentially – which can only happen if the kernels in Test A showed some degree of concurrency.

Does that help?

It’s a little tricky to detect this with certainty due to slight timing variations – but I guess I could give it a try. I’ll make myself a note to work on that for a future version of the CUDA SDK.

–Cliff

Thnaks Cliff, it’s always interesting and enlightening to have infos from the developer itself!

So if you mean that these kernels taking the same time with concurrent launch means they are executing sequentially, how would you translate that on the GeForce 9400M the time doubled. This is what is actually troubling for me…
Time should have been same or similar on 9400M on the two run, especially if it DOESN’T SUPPORT parallel execution. Raising execution time par a 2X factor should have been an effect of running kernel in parallel and struggling for resources (dimishing efficiency) such as memory (in this case DDR3 main memory).

Anyway, it’s really interesting to know for sure that even if driver report concurrent kernel execution, only GT200 and over could offer true support.

Thnaks Cliff, it’s always interesting and enlightening to have infos from the developer itself!

So if you mean that these kernels taking the same time with concurrent launch means they are executing sequentially, how would you translate that on the GeForce 9400M the time doubled. This is what is actually troubling for me…
Time should have been same or similar on 9400M on the two run, especially if it DOESN’T SUPPORT parallel execution. Raising execution time par a 2X factor should have been an effect of running kernel in parallel and struggling for resources (dimishing efficiency) such as memory (in this case DDR3 main memory).

Anyway, it’s really interesting to know for sure that even if driver report concurrent kernel execution, only GT200 and over could offer true support.

Not GT200… GF100 (aka Fermi).

This part I actually can’t explain off the top of my head. I’d have to run it through the Visual Profiler to figure out what’s going on there. It’s on my to-do list.

–Cliff

Not GT200… GF100 (aka Fermi).

This part I actually can’t explain off the top of my head. I’d have to run it through the Visual Profiler to figure out what’s going on there. It’s on my to-do list.

–Cliff

Thanks Cliff!

I am working on a little open-source CUDA benchmark and plan to include concurrent execution test :shifty: