inter-warp synchronization troubles with persistent threads (__threadfence_block() ?)

mmp · November 9, 2010, 4:11am

I’m seeing some behavior I don’t understand in the short CUDA program below. I’m doing some experiments using persistent threads to do task scheduling on the GPU and am seeing some hangs I don’t understand.

In the simplified test case below, the idea is that the 0th lane of each warp will get a task number by doing an atomic increment on a counter in local memory, broadcast that task id to the rest of the threads in the warp, and then the threads will all do some work based on this. After the work is done, the 0th lane in the warp then updates a data structure with some statistics about the work they all did, again using an atomic to allocate a slot in the statistics array. Once the task count hits the total number of jobs to run, warps exit.

#define WARPS_PER_SM 16

#define WARP_NUM (threadIdx.y + blockIdx.x * blockDim.y)

#define N_TASKS 512

static __device__ unsigned int d_taskIndex;

static __device__ TaskStats *d_taskStats;

static __device__ int d_taskStatsOffset;

static __constant__ int d_numTaskStats;

__global__ void run() {

    while (true) {

        // Have the first thread of the warp get a task id; all the threads

        // in the warp will work on this task together.

        __shared__ volatile int tn[WARPS_PER_SM];

        if (threadIdx.x == 0)

            tn[threadIdx.y] = atomicAdd(&d_taskIndex, 1);

// Broadcast the task id to the other threads in the warp.  Note

        // that the break statement below shouldn't cause any divergence;

        // all lanes in the warp should go the same way here.

        int taskId = tn[threadIdx.y];

        if (taskId >= N_TASKS) break;

// Normally, we'd do a bunch of work here, keyed off of taskId....

// After the work is all done, we want to write out some statistics

        // for the work done in this warp; again just have the first thread

        // in the warp do this.

        if (threadIdx.x == 0) {

            int offset = atomicAdd(&d_taskStatsOffset, 1);

            if (offset < d_numTaskStats)

                d_taskStats[offset].warpNum = WARP_NUM;

        }

// If this is commented out, the GPU seems to go into an infinite

        // loop and the program has to be killed.  If it's present, then

        // the code runs as expected.

//        __threadfence_block();

    }

}

The code seems to hang or go into an infinite loop as written, but I don’t understand why. I’ve found that it runs fine if I compile with ‘-g -G’ to try to debug it, and I’ve found that it runs fine if I add the commented-out __threadfence_block() call. (It also runs fine if I write the second ‘do this once per warp’ test as (threadIdx.x == 1) ?!).

My understanding is that because I’m only communicating between lanes in a warp, I don’t need to do any kind of synchronization (and I assume, by extension, memory fences), and that declaring the shared memory I use to communicate as volatile should be enough for it to run as expected.

Is that understanding correct? Am I misunderstanding something else about the execution model that requires the __threadfence_block()? Or is there another issue with the code and the __threadfence_block() is just masking the real issue.

Thanks for any guidance!

-matt

(This is with CUDA 3.2.9 on Linux x64 (Ubuntu 10.04), the 260.19.12 development driver, and running on a GTX460. Since the 460 has 7 SMs, I’m launching 7 blocks with dimension 32x16 to (sort-of) fill the machine with threads.)

mmp · November 9, 2010, 4:11am

I’m seeing some behavior I don’t understand in the short CUDA program below. I’m doing some experiments using persistent threads to do task scheduling on the GPU and am seeing some hangs I don’t understand.

In the simplified test case below, the idea is that the 0th lane of each warp will get a task number by doing an atomic increment on a counter in local memory, broadcast that task id to the rest of the threads in the warp, and then the threads will all do some work based on this. After the work is done, the 0th lane in the warp then updates a data structure with some statistics about the work they all did, again using an atomic to allocate a slot in the statistics array. Once the task count hits the total number of jobs to run, warps exit.

#define WARPS_PER_SM 16

#define WARP_NUM (threadIdx.y + blockIdx.x * blockDim.y)

#define N_TASKS 512

static __device__ unsigned int d_taskIndex;

static __device__ TaskStats *d_taskStats;

static __device__ int d_taskStatsOffset;

static __constant__ int d_numTaskStats;

__global__ void run() {

    while (true) {

        // Have the first thread of the warp get a task id; all the threads

        // in the warp will work on this task together.

        __shared__ volatile int tn[WARPS_PER_SM];

        if (threadIdx.x == 0)

            tn[threadIdx.y] = atomicAdd(&d_taskIndex, 1);

// Broadcast the task id to the other threads in the warp.  Note

        // that the break statement below shouldn't cause any divergence;

        // all lanes in the warp should go the same way here.

        int taskId = tn[threadIdx.y];

        if (taskId >= N_TASKS) break;

// Normally, we'd do a bunch of work here, keyed off of taskId....

// After the work is all done, we want to write out some statistics

        // for the work done in this warp; again just have the first thread

        // in the warp do this.

        if (threadIdx.x == 0) {

            int offset = atomicAdd(&d_taskStatsOffset, 1);

            if (offset < d_numTaskStats)

                d_taskStats[offset].warpNum = WARP_NUM;

        }

// If this is commented out, the GPU seems to go into an infinite

        // loop and the program has to be killed.  If it's present, then

        // the code runs as expected.

//        __threadfence_block();

    }

}

The code seems to hang or go into an infinite loop as written, but I don’t understand why. I’ve found that it runs fine if I compile with ‘-g -G’ to try to debug it, and I’ve found that it runs fine if I add the commented-out __threadfence_block() call. (It also runs fine if I write the second ‘do this once per warp’ test as (threadIdx.x == 1) ?!).

My understanding is that because I’m only communicating between lanes in a warp, I don’t need to do any kind of synchronization (and I assume, by extension, memory fences), and that declaring the shared memory I use to communicate as volatile should be enough for it to run as expected.

Is that understanding correct? Am I misunderstanding something else about the execution model that requires the __threadfence_block()? Or is there another issue with the code and the __threadfence_block() is just masking the real issue.

Thanks for any guidance!

-matt

(This is with CUDA 3.2.9 on Linux x64 (Ubuntu 10.04), the 260.19.12 development driver, and running on a GTX460. Since the 460 has 7 SMs, I’m launching 7 blocks with dimension 32x16 to (sort-of) fill the machine with threads.)

Crankie · November 9, 2010, 5:58am

Hi,

Your block dimensions are 32x16 … This means you have 32 warps for a block. The threads will have their ids in the range [0 - 511] in a block.

If you want threads in a warp to do the same thing, you need to change the logic.

Threads 0 - 15 make one warp and execute each instruction simulaneously.

Threads 16-31 together.!

Threads 32 -47 together and so on…!

You can have something like this :

if (threadIdx.x % WARPS_PER_SM == 0)

        tn[threadIdx.y] = atomicAdd(&d_taskIndex, 1);

…

Let me know if this works…!!!

I’m seeing some behavior I don’t understand in the short CUDA program below. I’m doing some experiments using persistent threads to do task scheduling on the GPU and am seeing some hangs I don’t understand.

In the simplified test case below, the idea is that the 0th lane of each warp will get a task number by doing an atomic increment on a counter in local memory, broadcast that task id to the rest of the threads in the warp, and then the threads will all do some work based on this. After the work is done, the 0th lane in the warp then updates a data structure with some statistics about the work they all did, again using an atomic to allocate a slot in the statistics array. Once the task count hits the total number of jobs to run, warps exit.
#define WARPS_PER_SM 16

#define WARP_NUM (threadIdx.y + blockIdx.x * blockDim.y)

#define N_TASKS 512

static __device__ unsigned int d_taskIndex;

static __device__ TaskStats *d_taskStats;

static __device__ int d_taskStatsOffset;

static __constant__ int d_numTaskStats;

__global__ void run() {

    while (true) {

        // Have the first thread of the warp get a task id; all the threads

        // in the warp will work on this task together.

        __shared__ volatile int tn[WARPS_PER_SM];

        if (threadIdx.x == 0)

            tn[threadIdx.y] = atomicAdd(&d_taskIndex, 1);

// Broadcast the task id to the other threads in the warp.  Note

        // that the break statement below shouldn't cause any divergence;

        // all lanes in the warp should go the same way here.

        int taskId = tn[threadIdx.y];

        if (taskId >= N_TASKS) break;

// Normally, we'd do a bunch of work here, keyed off of taskId....

// After the work is all done, we want to write out some statistics

        // for the work done in this warp; again just have the first thread

        // in the warp do this.

        if (threadIdx.x == 0) {

            int offset = atomicAdd(&d_taskStatsOffset, 1);

            if (offset < d_numTaskStats)

                d_taskStats[offset].warpNum = WARP_NUM;

        }

// If this is commented out, the GPU seems to go into an infinite

        // loop and the program has to be killed.  If it's present, then

        // the code runs as expected.

//        __threadfence_block();

    }

}
The code seems to hang or go into an infinite loop as written, but I don’t understand why. I’ve found that it runs fine if I compile with ‘-g -G’ to try to debug it, and I’ve found that it runs fine if I add the commented-out __threadfence_block() call. (It also runs fine if I write the second ‘do this once per warp’ test as (threadIdx.x == 1) ?!).

My understanding is that because I’m only communicating between lanes in a warp, I don’t need to do any kind of synchronization (and I assume, by extension, memory fences), and that declaring the shared memory I use to communicate as volatile should be enough for it to run as expected.

Is that understanding correct? Am I misunderstanding something else about the execution model that requires the __threadfence_block()? Or is there another issue with the code and the __threadfence_block() is just masking the real issue.

Thanks for any guidance!

-matt

(This is with CUDA 3.2.9 on Linux x64 (Ubuntu 10.04), the 260.19.12 development driver, and running on a GTX460. Since the 460 has 7 SMs, I’m launching 7 blocks with dimension 32x16 to (sort-of) fill the machine with threads.)

Crankie · November 9, 2010, 5:58am

Hi,

Your block dimensions are 32x16 … This means you have 32 warps for a block. The threads will have their ids in the range [0 - 511] in a block.

If you want threads in a warp to do the same thing, you need to change the logic.

Threads 0 - 15 make one warp and execute each instruction simulaneously.

Threads 16-31 together.!

Threads 32 -47 together and so on…!

You can have something like this :

if (threadIdx.x % WARPS_PER_SM == 0)

        tn[threadIdx.y] = atomicAdd(&d_taskIndex, 1);

…

Let me know if this works…!!!

I’m seeing some behavior I don’t understand in the short CUDA program below. I’m doing some experiments using persistent threads to do task scheduling on the GPU and am seeing some hangs I don’t understand.

In the simplified test case below, the idea is that the 0th lane of each warp will get a task number by doing an atomic increment on a counter in local memory, broadcast that task id to the rest of the threads in the warp, and then the threads will all do some work based on this. After the work is done, the 0th lane in the warp then updates a data structure with some statistics about the work they all did, again using an atomic to allocate a slot in the statistics array. Once the task count hits the total number of jobs to run, warps exit.
#define WARPS_PER_SM 16

#define WARP_NUM (threadIdx.y + blockIdx.x * blockDim.y)

#define N_TASKS 512

static __device__ unsigned int d_taskIndex;

static __device__ TaskStats *d_taskStats;

static __device__ int d_taskStatsOffset;

static __constant__ int d_numTaskStats;

__global__ void run() {

    while (true) {

        // Have the first thread of the warp get a task id; all the threads

        // in the warp will work on this task together.

        __shared__ volatile int tn[WARPS_PER_SM];

        if (threadIdx.x == 0)

            tn[threadIdx.y] = atomicAdd(&d_taskIndex, 1);

// Broadcast the task id to the other threads in the warp.  Note

        // that the break statement below shouldn't cause any divergence;

        // all lanes in the warp should go the same way here.

        int taskId = tn[threadIdx.y];

        if (taskId >= N_TASKS) break;

// Normally, we'd do a bunch of work here, keyed off of taskId....

// After the work is all done, we want to write out some statistics

        // for the work done in this warp; again just have the first thread

        // in the warp do this.

        if (threadIdx.x == 0) {

            int offset = atomicAdd(&d_taskStatsOffset, 1);

            if (offset < d_numTaskStats)

                d_taskStats[offset].warpNum = WARP_NUM;

        }

// If this is commented out, the GPU seems to go into an infinite

        // loop and the program has to be killed.  If it's present, then

        // the code runs as expected.

//        __threadfence_block();

    }

}
The code seems to hang or go into an infinite loop as written, but I don’t understand why. I’ve found that it runs fine if I compile with ‘-g -G’ to try to debug it, and I’ve found that it runs fine if I add the commented-out __threadfence_block() call. (It also runs fine if I write the second ‘do this once per warp’ test as (threadIdx.x == 1) ?!).

My understanding is that because I’m only communicating between lanes in a warp, I don’t need to do any kind of synchronization (and I assume, by extension, memory fences), and that declaring the shared memory I use to communicate as volatile should be enough for it to run as expected.

Is that understanding correct? Am I misunderstanding something else about the execution model that requires the __threadfence_block()? Or is there another issue with the code and the __threadfence_block() is just masking the real issue.

Thanks for any guidance!

-matt

(This is with CUDA 3.2.9 on Linux x64 (Ubuntu 10.04), the 260.19.12 development driver, and running on a GTX460. Since the 460 has 7 SMs, I’m launching 7 blocks with dimension 32x16 to (sort-of) fill the machine with threads.)

mmp · November 9, 2010, 1:28pm

Unfortunately that’s not it–what I meant is that the x dimension of the blocks is 32 and the y dimension is 16 (just didn’t explain it clearly). And I have verified the thread numbering via printfs as well…

Thanks, though!

-matt

mmp · November 9, 2010, 1:28pm

Unfortunately that’s not it–what I meant is that the x dimension of the blocks is 32 and the y dimension is 16 (just didn’t explain it clearly). And I have verified the thread numbering via printfs as well…

Thanks, though!

-matt

karbous · December 8, 2010, 8:59am

I’m sorry for jumping into an old thread, but I’m highly curious if some solution to persistent threads exists? (I’m struggling with them in OpenCL. I found an article “Understanding the efficiency of ray traversal on GPUs” from NVidia researches T.Aila and S.Laine cited quite widely. So everything seems that persistent thread are promising and functional…)

However, I still think there should be a memory fence at least before
int taskId = tn[threadIdx.y];
As volatile according to NVIDIA_CUDA_C_ProgrammingGuide_3.1.pdf just ensures that volatile variable is read each time accessing it. Nevertheless the other threads can read it before the 0. threads changed its value. (I reckon only 32 threads are physically concurrent on one multiprocessor so it is possible).

Please let me know how far you get with persistent threads.

Edit: I finally figured out how to make persistent threads. The local block size must be equal to warp size (32) and updating driver to 260.19.21 one helped as well. Hope it will be useful for someone.

Topic		Replies	Views
Global thread barrier CUDA Programming and Performance	78	85643	December 23, 2011
questions about thread execution & volatile CUDA Programming and Performance	19	16895	December 29, 2008
Problem with correct branching within a warp CUDA Programming and Performance	23	15636	May 28, 2009
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29323	April 23, 2010
CUDA Warp Synchronization Problem CUDA Programming and Performance	5	2154	February 27, 2011
Test Multi Threading Spinning CUDA Programming and Performance	32	4811	July 20, 2011
atomicCAS does NOT seem to work Hardware Bug? or Improper use?? TESLA C1060 CUDA Programming and Performance	70	19747	January 21, 2010
More threads/block increase kernel execution time. WHY? CUDA Programming and Performance	51	8267	June 17, 2011
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3309	January 10, 2010
Efficient use of shared memory CUDA Programming and Performance	29	4405	December 2, 2019

inter-warp synchronization troubles with persistent threads (__threadfence_block() ?)

Related topics