I’m seeing some behavior I don’t understand in the short CUDA program below. I’m doing some experiments using persistent threads to do task scheduling on the GPU and am seeing some hangs I don’t understand.
In the simplified test case below, the idea is that the 0th lane of each warp will get a task number by doing an atomic increment on a counter in local memory, broadcast that task id to the rest of the threads in the warp, and then the threads will all do some work based on this. After the work is done, the 0th lane in the warp then updates a data structure with some statistics about the work they all did, again using an atomic to allocate a slot in the statistics array. Once the task count hits the total number of jobs to run, warps exit.
#define WARPS_PER_SM 16
#define WARP_NUM (threadIdx.y + blockIdx.x * blockDim.y)
#define N_TASKS 512
static __device__ unsigned int d_taskIndex;
static __device__ TaskStats *d_taskStats;
static __device__ int d_taskStatsOffset;
static __constant__ int d_numTaskStats;
__global__ void run() {
while (true) {
// Have the first thread of the warp get a task id; all the threads
// in the warp will work on this task together.
__shared__ volatile int tn[WARPS_PER_SM];
if (threadIdx.x == 0)
tn[threadIdx.y] = atomicAdd(&d_taskIndex, 1);
// Broadcast the task id to the other threads in the warp. Note
// that the break statement below shouldn't cause any divergence;
// all lanes in the warp should go the same way here.
int taskId = tn[threadIdx.y];
if (taskId >= N_TASKS) break;
// Normally, we'd do a bunch of work here, keyed off of taskId....
// After the work is all done, we want to write out some statistics
// for the work done in this warp; again just have the first thread
// in the warp do this.
if (threadIdx.x == 0) {
int offset = atomicAdd(&d_taskStatsOffset, 1);
if (offset < d_numTaskStats)
d_taskStats[offset].warpNum = WARP_NUM;
}
// If this is commented out, the GPU seems to go into an infinite
// loop and the program has to be killed. If it's present, then
// the code runs as expected.
// __threadfence_block();
}
}
The code seems to hang or go into an infinite loop as written, but I don’t understand why. I’ve found that it runs fine if I compile with ‘-g -G’ to try to debug it, and I’ve found that it runs fine if I add the commented-out __threadfence_block() call. (It also runs fine if I write the second ‘do this once per warp’ test as (threadIdx.x == 1) ?!).
My understanding is that because I’m only communicating between lanes in a warp, I don’t need to do any kind of synchronization (and I assume, by extension, memory fences), and that declaring the shared memory I use to communicate as volatile should be enough for it to run as expected.
Is that understanding correct? Am I misunderstanding something else about the execution model that requires the __threadfence_block()? Or is there another issue with the code and the __threadfence_block() is just masking the real issue.
Thanks for any guidance!
-matt
(This is with CUDA 3.2.9 on Linux x64 (Ubuntu 10.04), the 260.19.12 development driver, and running on a GTX460. Since the 460 has 7 SMs, I’m launching 7 blocks with dimension 32x16 to (sort-of) fill the machine with threads.)