Global thread barrier

Variation on the theme, taking advantage of that __threadfence() slows down execution to actually increase efficiency by hammering less on the memory controller, thus giving newly completed blocks a fair chance to write out the result and signal that they are done:

[codebox]device unsigned int count = 0;

device void global_sync(int blocks,unsigned int* count)

{

__syncthreads();

__threadfence();

if(threadIdx.x == 0)

if(atomicInc(count,blocks - 1) != (blocks - 1))

   do  __threadfence(); while(atomicOr(count,0));

__syncthreads();

}

[/codebox]

A guesstimate - based on how much computebound code I can add (or have to omit) in order not to break a realtime kernel - is that the call adds one usec to the longest codepath (perhaps a little less) on a GT220 running one block on each of its 6 processors.

Yes, if you’re using 4 devices to increase the amount of data to compute and no task dividing is possible. Recently tried out the following test, timing the kernel call overhead. For one device I got about 8 microseconds and as bad as 11 if three uint4* and two uint* were passed as arguments.

Then I programmed an omp version to time kernel calls on four devices, ensuring mutual exclusion between kernel calls in different host threads. What I got was a negligible overhead on account of the mutual exclusion and 41 microseconds of kernel call overhead (per device!!!). Further more, the tests I’ve been trying out show that, on multiGPU, calls to the runtime API must be in mutual exclusion, but the mutex locking and unlocking overhead won’t hide the overhead on setting the host to different device, regardless of whether the host threads are on different cores or not.

All that said, I obtain better performance programming a single device, than I obtain programming more than one. This happensbecause my kernel is iterative, and needs be called on host side several times and each iteration takes about 3 ms to complete. With a global thread barrier, which incidentally is just threadfencesystem() in devices of compute capability 2.0, I could call the kernel once for a given number of iterations and have it return to the host thread much later than 3 ms. In such a fashion the host would be able to send further data to other devices, call their respective kernel, and don’t get interference while calling kernels or cudamemcpying. As I’m not trying to speedup a 30 hours running time application to a 5 minutes one, but a 2.5 seconds on Matlab to at least 90 MB/s throughput (not overall, just the amount of data the LDPC decoding kernel decodes), the penalties suffered by host thread interferences on cuda calls are too heavy to that 90MB/s requirement.

threadfencesystem isn’t a global barrier…

sorry but i don’t think threadfencesystem() is a global bariar, it just makes sure that the writes to host memory are completed as well, you will still need to do the trick of blocking all the threads from progressing untill all the blocks reach that line. And make sure u don’t have more blocks then can be run concurrently on the hardware.

pg 113 - B5 cuda programming guide version 3.0

void __threadfence_system();

waits until all global and shared memory accesses made by the calling thread prior to

__threadfence_system() are visible to:

All threads in the thread block for shared memory accesses,

All threads in the device for global memory accesses,

Host threads for page-locked host memory accesses

Maybe you’re right about the need of having a grid with a number of concurrently running blocks exclusively, but threadfencesystem() will do the same as the grid sync that is achieve implicitely with a kernel call or explicitely with a cudaThreadSynchrnize() but on device side(edit:, concerning data coherency, but hey, that’s a memoryfence.)

EDIT: Sorry, I was confunding the concept of barrier with memoryfence (hard to imagine). It is no barrier at all, yes. It seems I’ve been talking rubbish on some posts.

pg 113 - B5 cuda programming guide version 3.0

void __threadfence_system();

waits until all global and shared memory accesses made by the calling thread prior to

__threadfence_system() are visible to:

All threads in the thread block for shared memory accesses,

All threads in the device for global memory accesses,

Host threads for page-locked host memory accesses

Maybe you’re right about the need of having a grid with a number of concurrently running blocks exclusively, but threadfencesystem() will do the same as the grid sync that is achieve implicitely with a kernel call or explicitely with a cudaThreadSynchrnize() but on device side(edit:, concerning data coherency, but hey, that’s a memoryfence.)

EDIT: Sorry, I was confunding the concept of barrier with memoryfence (hard to imagine). It is no barrier at all, yes. It seems I’ve been talking rubbish on some posts.

With a couple more tricks it can be turned into one, no? But having a memoryfence instruction for all blocks in the grid enables an iterative kernel to be fully programmed on device side, with no host calling the kernel several times, as data coherency is ensured by threadfencesystem(). Am I wrong?

With a couple more tricks it can be turned into one, no? But having a memoryfence instruction for all blocks in the grid enables an iterative kernel to be fully programmed on device side, with no host calling the kernel several times, as data coherency is ensured by threadfencesystem(). Am I wrong?

The results for one block, a persistent block, for a MGVF algorithm are presented in something called The Art of Performance Tuning of Cuda and Manycores Architectures. Thing is if your application running time is computing bound and takes a long time on a non-parallel implementation, than 2) is still a good option. It does occupies a 30th of the GPU’s SMs, but a good speedup is still expected.

But if running time on the application on the GPU is still computing bound, increasing the amount of data by a factor of 30 to occupy the entire GPU (assuming 1 block per SM on shared mem or register constraints) wouldn’t provide a nice and feasible approach in the age of no specific grid sync instruction on device?

I’m struggling too on an iterative algorithm…

The results for one block, a persistent block, for a MGVF algorithm are presented in something called The Art of Performance Tuning of Cuda and Manycores Architectures. Thing is if your application running time is computing bound and takes a long time on a non-parallel implementation, than 2) is still a good option. It does occupies a 30th of the GPU’s SMs, but a good speedup is still expected.

But if running time on the application on the GPU is still computing bound, increasing the amount of data by a factor of 30 to occupy the entire GPU (assuming 1 block per SM on shared mem or register constraints) wouldn’t provide a nice and feasible approach in the age of no specific grid sync instruction on device?

I’m struggling too on an iterative algorithm…

Not in theory. In practice? Good luck. You are looking down a very perilous path.

Not in theory. In practice? Good luck. You are looking down a very perilous path.

What is your task? Maybe it could be done more cuda natural form than iterative algorithm. Or use it on shared memory etc. If you need to check neigboring cells of data array, you may locate larger array in shared memory and complete many iterations there. And after merge data in global memory.

What is your task? Maybe it could be done more cuda natural form than iterative algorithm. Or use it on shared memory etc. If you need to check neigboring cells of data array, you may locate larger array in shared memory and complete many iterations there. And after merge data in global memory.

I’m not sure why you find that you need mutual exclusion for multi-gpu, maybe due to the way that you are utilizing OpenMP? The CUDA runtime is most certainly capable of driving N GPUs from N completely independent host threads without any mutual exclusion between them. With a completely decoupled problem that requires no communication between the GPUs in-between iterations one can achieve perfectly linear speedups with the number of GPUs. Problems that utilize a lot of PCI-e bandwidth to load/unload the device can see performance degradations with multiple thread writing to multiple GPUS depending on the memory/PCI-e hierarchy on the host node.

Fermi is supposed to decrease the context switching overhead, and the launch overhead too, I think. I will have to benchmark it when my card arrives tomorrow. However, my original statement still stands. For a 1 ms kernel launch, the 10us launch overhead is only 1%, and that percentage goes down as the computation time of the kernel increases, since the 10us is a fixed cost. There are a lot of easier ways to boost performance by 1% than working with extremely hard to use, complex (if not impossible), and performance-costing global synchronizations. Additionally, if you have many iterations you can just queue them all up with numerous async kernel launches in a row and amortize that 10us overhead.

How much of your overhead is in the cudaMemcpy()? That typically becomes the bottleneck when accelerating only a portion of a larger calculation.

I’m not sure why you find that you need mutual exclusion for multi-gpu, maybe due to the way that you are utilizing OpenMP? The CUDA runtime is most certainly capable of driving N GPUs from N completely independent host threads without any mutual exclusion between them. With a completely decoupled problem that requires no communication between the GPUs in-between iterations one can achieve perfectly linear speedups with the number of GPUs. Problems that utilize a lot of PCI-e bandwidth to load/unload the device can see performance degradations with multiple thread writing to multiple GPUS depending on the memory/PCI-e hierarchy on the host node.

Fermi is supposed to decrease the context switching overhead, and the launch overhead too, I think. I will have to benchmark it when my card arrives tomorrow. However, my original statement still stands. For a 1 ms kernel launch, the 10us launch overhead is only 1%, and that percentage goes down as the computation time of the kernel increases, since the 10us is a fixed cost. There are a lot of easier ways to boost performance by 1% than working with extremely hard to use, complex (if not impossible), and performance-costing global synchronizations. Additionally, if you have many iterations you can just queue them all up with numerous async kernel launches in a row and amortize that 10us overhead.

How much of your overhead is in the cudaMemcpy()? That typically becomes the bottleneck when accelerating only a portion of a larger calculation.

peastman, andradx

I think I found good and fast solution of your iteration problem.

global blocknum=0;
global completedblockquantity=0;

at the start of the block

shared current_block_num;

shared int currentiteration, blockindex;

if (threadIdx==0)
{
current_block_num=atomic_inc(blocknum);

currentiteration=current_block_num/blocks_per_iteration;
blockindex=current_block_num%blocks_per_iteration;

some code for waiting prior blocks to complete.
like this

wait while (completedblockquantity>=currentiteration*blocks_per_iteration)
}

__syncthreads();

data=datarray[blockindex*blocksize+threadIdx.x]

and so on

at the end of the block

__threadfence();

atomic_inc(completedblockquanity);

at kernell launch you need to mass blocks according to (data size)*(iteration quantity)

peastman, andradx

I think I found good and fast solution of your iteration problem.

global blocknum=0;
global completedblockquantity=0;

at the start of the block

shared current_block_num;

shared int currentiteration, blockindex;

if (threadIdx==0)
{
current_block_num=atomic_inc(blocknum);

currentiteration=current_block_num/blocks_per_iteration;
blockindex=current_block_num%blocks_per_iteration;

some code for waiting prior blocks to complete.
like this

wait while (completedblockquantity>=currentiteration*blocks_per_iteration)
}

__syncthreads();

data=datarray[blockindex*blocksize+threadIdx.x]

and so on

at the end of the block

__threadfence();

atomic_inc(completedblockquanity);

at kernell launch you need to mass blocks according to (data size)*(iteration quantity)

For a single-device I get: 2.9 ms on cudaMemcy ~1ms cudaMemcpyHostToDevice and the rest for cudaMemcpyDeviceToHost, kernel time is for one iteration 3.2ms. One iteration is nothing, my purpose was to increase number of iterations to about 10 while maintaining a good throughput (regarding a portion of the data I send to the device).

The OpenMP itself is not the problem, mutual exclusion or not, the cudaMemcpy on four devices increase to 9-15ms per device, whether on pthreads or omp.

Besides as the kernel execution time is close to the cudaMemcpy, if cudaMemcpy is executed in mutual exclusion, there’ll be another device launching the kernel for another iteration close in the timeline and any other device than itself is a nuisance.

Task dividing my problem would create the need to synchronize four devices per half iteration (each iteration runs two different algorithms, is an LDPC decoder running the min-sum algorithm) whose overhead would kill performance. So the four devices were to increase the amount of data by four and overlap processing time with cudaMemcpy, just like a stream based approach would on a single device.

For a single-device I get: 2.9 ms on cudaMemcy ~1ms cudaMemcpyHostToDevice and the rest for cudaMemcpyDeviceToHost, kernel time is for one iteration 3.2ms. One iteration is nothing, my purpose was to increase number of iterations to about 10 while maintaining a good throughput (regarding a portion of the data I send to the device).

The OpenMP itself is not the problem, mutual exclusion or not, the cudaMemcpy on four devices increase to 9-15ms per device, whether on pthreads or omp.

Besides as the kernel execution time is close to the cudaMemcpy, if cudaMemcpy is executed in mutual exclusion, there’ll be another device launching the kernel for another iteration close in the timeline and any other device than itself is a nuisance.

Task dividing my problem would create the need to synchronize four devices per half iteration (each iteration runs two different algorithms, is an LDPC decoder running the min-sum algorithm) whose overhead would kill performance. So the four devices were to increase the amount of data by four and overlap processing time with cudaMemcpy, just like a stream based approach would on a single device.