Is there a way to put an eligible warp into stall/sleep/suspension? By eligible, I mean a warp that can issue instruction.
I have an application where the warps need to wait for a flag (which is unique for each warp) before they can proceed. One naive solution is to busy wait on the flag, but this wastes GPU’s resources which otherwise could be used by other warps for useful operations. Thus, I am looking for a way to periodically stall the warp until its flag is set.
Thanks allanmac! Looks like an interesting way for inter-warp synchronization.
Unfortunately, I forgot to mention that the flags are set by the host. Under this assumption, do you still believe that I can use the “bar”? Isn’t bar more suitable for inter-warp synchronization?
Yet, I think I may have a simple idea on how to use the bars to implement synchronization between CPU and GPU. The idea is that warp1 waits on an event (using __bar_sync) that will be triggered by warp2. This way, warp2 is responsible for checking warp1’s flag and “awaken” it (using __bar_arrive) when warp1’s flag is set by the host.
I have not put much thought into this though, and I must admit that I have not read the paper yet. But I would appreciate to know your opinion!
If your host is driving processing then I would first determine if a simple kernel launch is good enough for your application. Kernel launch overhead and latency are quite low (for most purposes).
Were you thinking of making your kernels stay resident and monitoring page-locked or PCIe mapped memory? If so, then you should take a look at __threadfence_system().
You should convince yourself that a non-terminating kernel is actually a win over a more mundane approach.
Any busy loop that includes a device or system memory access is going to spend most of it’s time waiting on one of the internal hardware barriers. Other warps not in that busy loop will be free to continue processing. This is how spinlocks are designed to operate with the atomic.cas/exch instructions.
Yes allanmac, I indeed want to make my kernel non-terminating! And the reason is that the kernel launch latency is too high for my application, which is a type of packet processing application.
Also, as much as I know about __threadfence_system(), it is used to avoid race conditions, i.e., to ensure that all memory operations are completed before the thread continues its execution. However, in my case, I don’t need this kind of memory synchronization between the threads. What I need is a barrier on a specific memory location (the flag) that needs to be set before the warp can proceed - which obviously does not exist.
And BTW, my application is targeted for Jetson dev-kit featuring an integrated GPU with unified CPU-GPU memory, and hence I use zero-copy memory.
scottgray, do you mean that busy waiting that includes memory is not that costly after all since most of the time the warp is stalled waiting for the memory operation to be completed?
That’s exactly right. If possible you want to minimize the bandwidth consumed. So only have 1 thread making the request for the warp. And you can of course make that one request also serve as the signal to exit the loop.