The CUDA documentations says: “Different streams, on the other hand, may execute their operations out of order with respect to one another or concurrently.” Now my question is this: If I have two streams, will the kernel executions on the GPU from each stream be serialized?
The CUDA documentation clearly says that data transfer and kernel execution can happen in parallel on some GPUs. But I want to know if 2 kernels can execute in parallel, or are they always serialized? My guess is that they are serialized, but I’d appreciate insight on this from somebody more knowledgeable than me.
I’m curious to know - is this something that is likely to change in the near future?
I am working on this library that will make multiple kernel calls. Right now, the implementation serializes the invocation by using a task queue. One concern is whether this is likely cause a performance hit if in the future the GPU actually becomes capable of running two kernels in parallel. My opinion is that it shouldn’t matter, because only the invocation is serialized by the queue and there is still much scope for the kernels to execute in parallel, but then again, I am a biased party because I wrote the queue and don’t want to get rid of it. :)
I wouldn’t lose sleep over it - there is a difference between a “performance hit” and “performance left on the table.” A “performance hit” is when a new release goes slower than its precessor - this tends to be bad for adoption (see early press on Pentium Pro due to performance regressions running 16-bit code… Intel fixed this in the Pentium2). “Performance left on the table” is an opportunity for you to deliver performance improvments in a software release :-).
If we enable kernels to execute in parallel, it would be through the streams abstraction - kernels launched in separate streams may execute in parallel on a future driver release. Currently they are executed sequentially in the order they are presented to the context. So if you want to future-proof your code, architect it so you can launch the kernels in separate streams.
It is very difficult even for the most well-meaning developers to future-proof their apps against this type of behavior change, so for compatibility we may have to make apps opt into new concurrency behaviors. (One possible use of the flags word in cuStreamCreate.)
The big problem I see with this approach is that the kernels are now very tightly coupled. Having said that, I do think it is a very interesting idea and I am going to look for opportunities to play this trick. :)
True. But you can have each of your kernel as a separate non-global device function and call them appropriately in your global kernel.
CUDA inlines function calls. Thus, this will at-least separate the kernel functions making them more readable. I dont know if you can declare shared variables under these non-global functions. Hopefully you should be able to.
BUT – your shared-memory usage will become the SUM of shared memory uage of both kernels. You may NOT want this. But you do have a solution. The SHARED MEMORY variables can be made to be an UNION of the shared-memory requirements of the different kernels that you use. It is very tricky. You can definitely make it. But again, readability would be in shambles. :-) But again, if you are making it as UNION then you need to declare the big UNION inside the GLOBAL kernel and pass the appropriate structure to the non-global kernel function as an argument. Probably, that would improve readability. If at all, you do this, I would recommend you to go over your PTX code to know how the compiler has generated code.
A similar case with your register-usage. But unfortunately, I dont think you can avoid it – unless CUDA compilers are smart enough.
The shared-memory (if u dont use unions) and register constraints can bring down your CUDA occupancy and MAY BE – your efficiency too.
I am trying to work on the same kind of problem, I want to be able to give different SMs different tasks to work on. The only way to do this I have found is to give different thread blocks to different device functions that are called from one big kernel by doing as Sarnath suggested.
pramodsub: So I am wondering if you have been able to make any progress? So far I have only tested very static launches of kernels, and compared them to linear launches, it seems that linear always seem to perform faster.
I was asking out of curiosity rather than a need to actually execute kernels in parallel. So, I haven’t made any progress. However, a colleague of mine has implemented system where they’re software pipelining algorithms using the trick that Sarnath suggested. It’s not quite what you’re looking for, but might be a useful trick to know …