Stalling/Suspending a warp

Is there a way to put an eligible warp into stall/sleep/suspension? By eligible, I mean a warp that can issue instruction.

I have an application where the warps need to wait for a flag (which is unique for each warp) before they can proceed. One naive solution is to busy wait on the flag, but this wastes GPU’s resources which otherwise could be used by other warps for useful operations. Thus, I am looking for a way to periodically stall the warp until its flag is set.

If you’re on an sm_20+ device, you can create up to 16 unique warp synchronization barriers in a block.

I recently used the bar.arrive/sync instructions and they work as described.

The PTX manual has the details on the “bar” instruction:

8.7.12.1. Parallel Synchronization and Communication Instructions: bar

Here are the routines that I used:

DEVICE_STATIC_INTRINSIC_QUALIFIERS
void
__bar_arrive(const u32 id, const u32 threads)
{
#if __CUDA_ARCH__ >= 200
  asm volatile ("bar.arrive %0, %1;" : : "r"(id), "r"(threads));
#endif
}

DEVICE_STATIC_INTRINSIC_QUALIFIERS
void
__bar_sync(const u32 id, const u32 threads)
{
#if __CUDA_ARCH__ >= 200
  asm volatile ("bar.sync %0, %1;" : : "r"(id), "r"(threads));
#endif
}

There are also alternatives to using warp barriers:

Fine-Grained Synchronizations and Dataflow Programming on GPUs

One caveat… I found that debugging these named barriers in Nsight was not a smooth process.

Thanks allanmac! Looks like an interesting way for inter-warp synchronization.

Unfortunately, I forgot to mention that the flags are set by the host. Under this assumption, do you still believe that I can use the “bar”? Isn’t bar more suitable for inter-warp synchronization?

Yet, I think I may have a simple idea on how to use the bars to implement synchronization between CPU and GPU. The idea is that warp1 waits on an event (using __bar_sync) that will be triggered by warp2. This way, warp2 is responsible for checking warp1’s flag and “awaken” it (using __bar_arrive) when warp1’s flag is set by the host.

I have not put much thought into this though, and I must admit that I have not read the paper yet. But I would appreciate to know your opinion!

Got it…

If your host is driving processing then I would first determine if a simple kernel launch is good enough for your application. Kernel launch overhead and latency are quite low (for most purposes).

Were you thinking of making your kernels stay resident and monitoring page-locked or PCIe mapped memory? If so, then you should take a look at __threadfence_system().

You should convince yourself that a non-terminating kernel is actually a win over a more mundane approach.

Any busy loop that includes a device or system memory access is going to spend most of it’s time waiting on one of the internal hardware barriers. Other warps not in that busy loop will be free to continue processing. This is how spinlocks are designed to operate with the atomic.cas/exch instructions.

Yes allanmac, I indeed want to make my kernel non-terminating! And the reason is that the kernel launch latency is too high for my application, which is a type of packet processing application.

Also, as much as I know about __threadfence_system(), it is used to avoid race conditions, i.e., to ensure that all memory operations are completed before the thread continues its execution. However, in my case, I don’t need this kind of memory synchronization between the threads. What I need is a barrier on a specific memory location (the flag) that needs to be set before the warp can proceed - which obviously does not exist.

And BTW, my application is targeted for Jetson dev-kit featuring an integrated GPU with unified CPU-GPU memory, and hence I use zero-copy memory.

scottgray, do you mean that busy waiting that includes memory is not that costly after all since most of the time the warp is stalled waiting for the memory operation to be completed?

That’s exactly right. If possible you want to minimize the bandwidth consumed. So only have 1 thread making the request for the warp. And you can of course make that one request also serve as the signal to exit the loop.

Though I should point out that if you have any synchthreads calls after the busy loop you could end up blocking the whole block.