Is it possible to execute kernels in parallel

pramodsub · January 5, 2008, 9:21am

Hi everybody,

The CUDA documentations says: “Different streams, on the other hand, may execute their operations out of order with respect to one another or concurrently.” Now my question is this: If I have two streams, will the kernel executions on the GPU from each stream be serialized?

The CUDA documentation clearly says that data transfer and kernel execution can happen in parallel on some GPUs. But I want to know if 2 kernels can execute in parallel, or are they always serialized? My guess is that they are serialized, but I’d appreciate insight on this from somebody more knowledgeable than me.

Thanks,
Pramod

nwilt · January 5, 2008, 11:51am

Yes, currently they are serialized.

pramodsub · January 5, 2008, 2:18pm

Thanks for the reply.

I’m curious to know - is this something that is likely to change in the near future?

I am working on this library that will make multiple kernel calls. Right now, the implementation serializes the invocation by using a task queue. One concern is whether this is likely cause a performance hit if in the future the GPU actually becomes capable of running two kernels in parallel. My opinion is that it shouldn’t matter, because only the invocation is serialized by the queue and there is still much scope for the kernels to execute in parallel, but then again, I am a biased party because I wrote the queue and don’t want to get rid of it. :)

Do you have any thoughts on this?

Thanks again,

Pramod

nwilt · January 6, 2008, 2:11am

I wouldn’t lose sleep over it - there is a difference between a “performance hit” and “performance left on the table.” A “performance hit” is when a new release goes slower than its precessor - this tends to be bad for adoption (see early press on Pentium Pro due to performance regressions running 16-bit code… Intel fixed this in the Pentium2). “Performance left on the table” is an opportunity for you to deliver performance improvments in a software release :-).

If we enable kernels to execute in parallel, it would be through the streams abstraction - kernels launched in separate streams may execute in parallel on a future driver release. Currently they are executed sequentially in the order they are presented to the context. So if you want to future-proof your code, architect it so you can launch the kernels in separate streams.

It is very difficult even for the most well-meaning developers to future-proof their apps against this type of behavior change, so for compatibility we may have to make apps opt into new concurrency behaviors. (One possible use of the flags word in cuStreamCreate.)

pramodsub · January 6, 2008, 5:03am

Thanks a lot! This is very useful advice. :)

Sarnath · January 7, 2008, 6:20am

One another idea given my MisterAnderson long back was to execute different code based on “blockIdx”. just say

if (blockIdx.x > gridDim.x/2)

{

    ....... perform some operation .....

} else {

    ...... perform something else .......

}

This would give you a concurrent executing kernel effect.

An even better idea would be to based on odd and even block numbers.

pramodsub · January 8, 2008, 5:01pm

One another idea given my MisterAnderson long back was to execute different code based on “blockIdx”. just say
if (blockIdx.x > gridDim.x/2)

{

    ....... perform some operation .....

} else {

    ...... perform something else .......

}
This would give you a concurrent executing kernel effect.

An even better idea would be to based on odd and even block numbers.

[snapback]306234[/snapback]

The big problem I see with this approach is that the kernels are now very tightly coupled. Having said that, I do think it is a very interesting idea and I am going to look for opportunities to play this trick. :)

Sarnath · January 9, 2008, 2:56am

True. But you can have each of your kernel as a separate non-global device function and call them appropriately in your global kernel.

CUDA inlines function calls. Thus, this will at-least separate the kernel functions making them more readable. I dont know if you can declare shared variables under these non-global functions. Hopefully you should be able to.

BUT – your shared-memory usage will become the SUM of shared memory uage of both kernels. You may NOT want this. But you do have a solution. The SHARED MEMORY variables can be made to be an UNION of the shared-memory requirements of the different kernels that you use. It is very tricky. You can definitely make it. But again, readability would be in shambles. :-) But again, if you are making it as UNION then you need to declare the big UNION inside the GLOBAL kernel and pass the appropriate structure to the non-global kernel function as an argument. Probably, that would improve readability. If at all, you do this, I would recommend you to go over your PTX code to know how the compiler has generated code.

A similar case with your register-usage. But unfortunately, I dont think you can avoid it – unless CUDA compilers are smart enough.

The shared-memory (if u dont use unions) and register constraints can bring down your CUDA occupancy and MAY BE – your efficiency too.

Anyway, Good Luck!

alexao · February 5, 2009, 2:36pm

I am trying to work on the same kind of problem, I want to be able to give different SMs different tasks to work on. The only way to do this I have found is to give different thread blocks to different device functions that are called from one big kernel by doing as Sarnath suggested.

pramodsub: So I am wondering if you have been able to make any progress? So far I have only tested very static launches of kernels, and compared them to linear launches, it seems that linear always seem to perform faster.

pramodsub · February 6, 2009, 11:22am

I was asking out of curiosity rather than a need to actually execute kernels in parallel. So, I haven’t made any progress. However, a colleague of mine has implemented system where they’re software pipelining algorithms using the trick that Sarnath suggested. It’s not quite what you’re looking for, but might be a useful trick to know …

Topic		Replies	Views
Parallel execution of multiple kernels possible? CUDA Programming and Performance	1	1697	June 4, 2008
Why Different Kernels in Different Streams Behave Nearly Serially While Same Kernels Overlap Perfectly? CUDA Programming and Performance cuda , kernel	5	302	March 16, 2025
Cuda-task parallelism on a single GPU CUDA Developer Tools	0	552	October 15, 2020
Kernels executing concurrently in different streams do not behave as expected CUDA Programming and Performance	5	553	December 20, 2023
Do kernels/streams execute concurrently? CUDA Programming and Performance	1	1231	October 15, 2008
CUDA 4.0 concurrent kernels CUDA Programming and Performance	6	1855	March 28, 2011
Fermi streams and kernels CUDA Programming and Performance	5	1924	July 22, 2010
Cuda with openMP CUDA Programming and Performance	10	15961	June 13, 2010
Multiple simultaneous kernels across different streams CUDA Programming and Performance	3	4640	February 3, 2009
Concurrent kernels execution using streams in multiple CPU threads CUDA Programming and Performance	7	10787	June 26, 2012

Is it possible to execute kernels in parallel

Related topics