while (global_counter > 0) {
if (global_array[thread_id] == 1) {
// 1. Do some work (might enable some other threads)
// 2. Update global_counter atomically
}
}
The problem I fear is that some threads could be spinning until they become enabled or the global_counter reaches zero. I already have a solution that doesn’t work this way but I’m trying to avoid unnecessary kernel launches. Would these infinitely looping threads cause a problem?
Having certain thread spin until their work becomes available almost universally is the wrong approach in CUDA.
In your case I see the problem that threads might exit if global_counter temporarily drops to zero. Subsequently their work will never get done and the kernel deadlocks.
Furthermore the approach seems quite inefficient - in most cases warps would probably run with only a single thread enabled, wasting 31/32 (or 96.875%) of the ressources.
One possible solution would be to keep a list of items to do (using atomic operations) and have a warp work on these as soon as the list has at least 32 entries. However this still seems overly complicated and I’d guess better solution can be found for concrete problems.
In my particular application the global_counter cannot temporarily drop to zero, if it does, it means my task is done. The optimization you suggested is on my list of things to do, however, right now my main concern is if a couple of threads spinning would badly affect the execution (scheduling) of other threads? if it does not, then I believe I can squeeze out a significant performance gain from my GPU External Image
On top of that I can easily implement the optimization you suggested and get even more out of it! External Image
There is a certain difference in philosophy - don’t assign tasks to threads (which would force the thread to wait if the task is not ready yet). Instead, have idle threads grab the next task available, regardless of any thread ID.
This sounds good! External Image I have two questions though,
How is such a “grab” operation normally implemented on GPU code? (whilst making sure that no two threads grab the same task) atomic operations?
What if one task can spawn several other tasks dynamically? In this case we cannot deploy new threads on a GPU, we cannot have anything like threads pools either.
Google “CUDA persistent threads” for some more info. One of the first fields where the concept was employed was in raytracing, where a single ray hitting an object could trigger multiple secondary rays (or none at all). Looking at your description again, this looks quite similar.