In the docs for cuda::memcpy_async
there are versions that “issue the copy in the current thread” and versions that "cooperatively issue the copy across all threads in a group
. I’d like to understand the context of why both exist. When would one want to use one or the other? Thanks
Do you know what it means when threads cooperate?
I could have a single thread fill an array with a value.
I could also have 32 threads cooperate to fill an array with a value.
On a GPU, the 32 threads will do that job quicker than the single thread will.
I think it is probably that the cooperative version of that function would be more efficient, in some cases, although the hardware async mechanism (cc8.0 and above) is perhaps not well enough described to immediately make that determination.
But my kernel design may not always involve 32 available threads, working together. If I need to get something done, and I only have a single thread available to do it, then only having the more efficient cooperative version would not allow me to take advantage of anything else that function may offer.
You could imagine a producer-consumer model, where one warp is the producer, and one or several warps are the consumer.
You could also imagine a producer-consumer model where one thread is the producer, and one or several threads are the consumer.
In that latter case, having only the cooperative, (perhaps more efficient) version would be a limiting factor; I might not be able to use it.
Could a factor be the distinction between TMA (which is issued by a single thread) and older asynchronous copy features?
It seems like it. The TMA applies to cc9.0 and above, so that might flip any of my previous statements around, depending on which architecture you are running on. TMA might be more efficient than the other alternatives. Such things are difficult to make blanket statements about; it often depends on what else is going on in your code.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.