inter-warp synchronization troubles with persistent threads (__threadfence_block() ?)

I’m seeing some behavior I don’t understand in the short CUDA program below. I’m doing some experiments using persistent threads to do task scheduling on the GPU and am seeing some hangs I don’t understand.

In the simplified test case below, the idea is that the 0th lane of each warp will get a task number by doing an atomic increment on a counter in local memory, broadcast that task id to the rest of the threads in the warp, and then the threads will all do some work based on this. After the work is done, the 0th lane in the warp then updates a data structure with some statistics about the work they all did, again using an atomic to allocate a slot in the statistics array. Once the task count hits the total number of jobs to run, warps exit.

#define WARPS_PER_SM 16

#define WARP_NUM (threadIdx.y + blockIdx.x * blockDim.y)

#define N_TASKS 512

static __device__ unsigned int d_taskIndex;

static __device__ TaskStats *d_taskStats;

static __device__ int d_taskStatsOffset;

static __constant__ int d_numTaskStats;

__global__ void run() {

    while (true) {

        // Have the first thread of the warp get a task id; all the threads

        // in the warp will work on this task together.

        __shared__ volatile int tn[WARPS_PER_SM];

        if (threadIdx.x == 0)

            tn[threadIdx.y] = atomicAdd(&d_taskIndex, 1);

// Broadcast the task id to the other threads in the warp.  Note

        // that the break statement below shouldn't cause any divergence;

        // all lanes in the warp should go the same way here.

        int taskId = tn[threadIdx.y];

        if (taskId >= N_TASKS) break;

// Normally, we'd do a bunch of work here, keyed off of taskId....

// After the work is all done, we want to write out some statistics

        // for the work done in this warp; again just have the first thread

        // in the warp do this.

        if (threadIdx.x == 0) {

            int offset = atomicAdd(&d_taskStatsOffset, 1);

            if (offset < d_numTaskStats)

                d_taskStats[offset].warpNum = WARP_NUM;

        }

// If this is commented out, the GPU seems to go into an infinite

        // loop and the program has to be killed.  If it's present, then

        // the code runs as expected.

//        __threadfence_block();

    }

}

The code seems to hang or go into an infinite loop as written, but I don’t understand why. I’ve found that it runs fine if I compile with ‘-g -G’ to try to debug it, and I’ve found that it runs fine if I add the commented-out __threadfence_block() call. (It also runs fine if I write the second ‘do this once per warp’ test as (threadIdx.x == 1) ?!).

My understanding is that because I’m only communicating between lanes in a warp, I don’t need to do any kind of synchronization (and I assume, by extension, memory fences), and that declaring the shared memory I use to communicate as volatile should be enough for it to run as expected.

Is that understanding correct? Am I misunderstanding something else about the execution model that requires the __threadfence_block()? Or is there another issue with the code and the __threadfence_block() is just masking the real issue.

Thanks for any guidance!

-matt

(This is with CUDA 3.2.9 on Linux x64 (Ubuntu 10.04), the 260.19.12 development driver, and running on a GTX460. Since the 460 has 7 SMs, I’m launching 7 blocks with dimension 32x16 to (sort-of) fill the machine with threads.)

I’m seeing some behavior I don’t understand in the short CUDA program below. I’m doing some experiments using persistent threads to do task scheduling on the GPU and am seeing some hangs I don’t understand.

In the simplified test case below, the idea is that the 0th lane of each warp will get a task number by doing an atomic increment on a counter in local memory, broadcast that task id to the rest of the threads in the warp, and then the threads will all do some work based on this. After the work is done, the 0th lane in the warp then updates a data structure with some statistics about the work they all did, again using an atomic to allocate a slot in the statistics array. Once the task count hits the total number of jobs to run, warps exit.

#define WARPS_PER_SM 16

#define WARP_NUM (threadIdx.y + blockIdx.x * blockDim.y)

#define N_TASKS 512

static __device__ unsigned int d_taskIndex;

static __device__ TaskStats *d_taskStats;

static __device__ int d_taskStatsOffset;

static __constant__ int d_numTaskStats;

__global__ void run() {

    while (true) {

        // Have the first thread of the warp get a task id; all the threads

        // in the warp will work on this task together.

        __shared__ volatile int tn[WARPS_PER_SM];

        if (threadIdx.x == 0)

            tn[threadIdx.y] = atomicAdd(&d_taskIndex, 1);

// Broadcast the task id to the other threads in the warp.  Note

        // that the break statement below shouldn't cause any divergence;

        // all lanes in the warp should go the same way here.

        int taskId = tn[threadIdx.y];

        if (taskId >= N_TASKS) break;

// Normally, we'd do a bunch of work here, keyed off of taskId....

// After the work is all done, we want to write out some statistics

        // for the work done in this warp; again just have the first thread

        // in the warp do this.

        if (threadIdx.x == 0) {

            int offset = atomicAdd(&d_taskStatsOffset, 1);

            if (offset < d_numTaskStats)

                d_taskStats[offset].warpNum = WARP_NUM;

        }

// If this is commented out, the GPU seems to go into an infinite

        // loop and the program has to be killed.  If it's present, then

        // the code runs as expected.

//        __threadfence_block();

    }

}

The code seems to hang or go into an infinite loop as written, but I don’t understand why. I’ve found that it runs fine if I compile with ‘-g -G’ to try to debug it, and I’ve found that it runs fine if I add the commented-out __threadfence_block() call. (It also runs fine if I write the second ‘do this once per warp’ test as (threadIdx.x == 1) ?!).

My understanding is that because I’m only communicating between lanes in a warp, I don’t need to do any kind of synchronization (and I assume, by extension, memory fences), and that declaring the shared memory I use to communicate as volatile should be enough for it to run as expected.

Is that understanding correct? Am I misunderstanding something else about the execution model that requires the __threadfence_block()? Or is there another issue with the code and the __threadfence_block() is just masking the real issue.

Thanks for any guidance!

-matt

(This is with CUDA 3.2.9 on Linux x64 (Ubuntu 10.04), the 260.19.12 development driver, and running on a GTX460. Since the 460 has 7 SMs, I’m launching 7 blocks with dimension 32x16 to (sort-of) fill the machine with threads.)

Hi,

Your block dimensions are 32x16 … This means you have 32 warps for a block. The threads will have their ids in the range [0 - 511] in a block.

If you want threads in a warp to do the same thing, you need to change the logic.

Threads 0 - 15 make one warp and execute each instruction simulaneously.

Threads 16-31 together.!

Threads 32 -47 together and so on…!

You can have something like this :

if (threadIdx.x % WARPS_PER_SM == 0)

        tn[threadIdx.y] = atomicAdd(&d_taskIndex, 1);

Let me know if this works…!!!

Hi,

Your block dimensions are 32x16 … This means you have 32 warps for a block. The threads will have their ids in the range [0 - 511] in a block.

If you want threads in a warp to do the same thing, you need to change the logic.

Threads 0 - 15 make one warp and execute each instruction simulaneously.

Threads 16-31 together.!

Threads 32 -47 together and so on…!

You can have something like this :

if (threadIdx.x % WARPS_PER_SM == 0)

        tn[threadIdx.y] = atomicAdd(&d_taskIndex, 1);

Let me know if this works…!!!

Unfortunately that’s not it–what I meant is that the x dimension of the blocks is 32 and the y dimension is 16 (just didn’t explain it clearly). And I have verified the thread numbering via printfs as well…

Thanks, though!

-matt

Unfortunately that’s not it–what I meant is that the x dimension of the blocks is 32 and the y dimension is 16 (just didn’t explain it clearly). And I have verified the thread numbering via printfs as well…

Thanks, though!

-matt

I’m sorry for jumping into an old thread, but I’m highly curious if some solution to persistent threads exists? (I’m struggling with them in OpenCL. I found an article “Understanding the efficiency of ray traversal on GPUs” from NVidia researches T.Aila and S.Laine cited quite widely. So everything seems that persistent thread are promising and functional…)

However, I still think there should be a memory fence at least before
int taskId = tn[threadIdx.y];
As volatile according to NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf just ensures that volatile variable is read each time accessing it. Nevertheless the other threads can read it before the 0. threads changed its value. (I reckon only 32 threads are physically concurrent on one multiprocessor so it is possible).

Please let me know how far you get with persistent threads.

Edit: I finally figured out how to make persistent threads. The local block size must be equal to warp size (32) and updating driver to 260.19.21 one helped as well. Hope it will be useful for someone.