__syncblocks 101 Primitives for Interblock syncronization

[ __syncblocks 101 Updated for M blocks N Threads (no limits) ]

Further to thread on inter block communication
http://forums.nvidia.com/index.php?showtop…28&#entry241028

Here are synchronization C primitives these work well on my GPU (8800 GTX).
A regression test for 16 blocks 8 threads (16B,8T) was fired and worked without a single error.

The code has also been tried for (80B,16T); it seems to work but no regression tests have been done (will benchmark soon).

DEFINITIONS OF S AND G:

S: is a global integer passed when launching the kernel from the CPU.
Initial value -1

G : Any “useless” global memory passed when launching the kernel from the CPU used for “dummy” scattering.

/////////////////////////////cuSignal.h////////////////////////////////////////////////

#define WAIT(S,G) while((*S)!=blockIdx.x-1){G[0] = 2;}
#define SIGNAL(S) if(threadIdx.x==0){if((*S)==gridDim.x-2){(*S) = -1; }else{(*S)++;} } __syncthreads();

///////////////////////////////////////////////////////////////////////////////////////

[1] A typical Critical region code:
WAIT(S,G)

/Critical Section Code here ( different blocks can access this serially (vital for certain algorithms where collation of data for the next kernel can be done on gpu rather than cpu with large memory transfers)/

SIGNAL(S)

[2] __syncblocks()

device int iBarrier;

//Initialize barrier(s)
WAIT(S,G)

if(blockIdx.x==0 && threadIdx.x==0)
{
iBarrier = 0;
}

SIGNAL(S)

//Do work in parallel

WAIT(S,G)
iBarrier++; //Signal work is over
SIGNAL(S)

while(iBarrier<gridDim.x);

Use these constructs carefully so as not to lose the spirit of data parallelism!.

Curious if this code works on other hardware variants.

This code will not always work. It makes the assumption that all blocks are running at once on the card, which is not the case.

Here is some sample code that breaks it:

The code creates a global array indexing from 0 to BLOCKS in a global array by referencing the index before the current index. If the code does not work it will print out a message, but it should not deadlock.

With BLOCKS set to anything higher then 12436 the code deadlocks.

// includes, system

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <ctype.h>

// includes, project

#include <cutil.h>

#include <assert.h>

#define WAIT(S) while((*S)!=blockIdx.x-1){G[0] = 2;}

#define SIGNAL(S,G) if(threadIdx.x==0){if((*S)==gridDim.x-2){(*S) = -1; }else{(*S)++;} } __syncthreads();

#define BLOCKS 12436 // this does work

//#define BLOCKS 12437 // this does not work

__global__ void device_code(int*S,int*G,int* output) {

    WAIT(S);

    if (blockIdx.x ==0) {

        output[blockIdx.x] =0;

    } else {

        output[blockIdx.x] = output[blockIdx.x-1]+1;

    }

    SIGNAL(S,G);

}

int main() {

    dim3 dimgrid(BLOCKS);

    dim3 dimblock(8);

   int* output;

    cudaMalloc((void**)&output,sizeof(int)*BLOCKS);

   int *S;

    int s = -1;

    cudaMalloc((void**)&S,sizeof(int));

    cudaMemcpy(S,&s,sizeof(int),cudaMemcpyHostToDevice);

   int *G;

    cudaMalloc((void**)&G,sizeof(int));

   int outputCPU[BLOCKS];

   printf("start\n");

   device_code<<<dimgrid,dimblock>>>(S,G,output);

    cudaThreadSynchronize();

   cudaMemcpy(outputCPU,output,BLOCKS*sizeof(int),cudaMemcpyDeviceToHost);

    printf("done with CUDA\n");

   int x;

    for (x=0;x<BLOCKS;x++) {

        if (outputCPU[x] != x) printf("BAD!: %d = %d\n",x,outputCPU[x]);

    }

    printf("end\n");

}

Thanks Jeff…

It seems the primitives work for 16 blocks and 8 threads comfortably ( hardware restrictions ). Since only 16 multi processors , 1 Block per multi processor.
I was under the impression this will work only for 16 BLOCKS.

But this still does not explain why the primitives work for high number of blocks below 12436.

Very puzzling indeed.

It is now possible to handle problems like fragmenting odd/even arrays without a round trip to the CPU (Yes it is fast too ).

Comments from Nvidia hardware folks?

Neeraj

Also randomly breaks for 4B 256T.

Neeraj, very nice work. What I really like is the G[0]=2 inside the spin-wait to trick the compiler into rereading *S on each iteration. I’m working on a similar problem, and if you were to code it as:

#define WAIT(S,G) while((*S)!=blockIdx.x-1); // do-nothing

then the compiler creates an infinite loop that never re-reads *S.

Have you found anything else to put in the while-loop body to trick the compiler into re-reading *S? I’d like something more efficient than a global memory access if possible.

To trick compiler into reading memory, one may declare S as volatile.
To trick compiler into thinking something is not empty, one can use if(threadIdx.x<0)__syncthreads();
For an inter-block sync, %physid may be more reliable.

This has been discussed a number of times before. We don’t encourage such attempts, due to the high dependance of such code on hardware details. While it is technically possible to achieve some type of inter-block synchronization under very controlled conditions on specific harware (via volatile global variables, spin-loops, etc.), those conditions change for different chips (for example, 16 multiprocessors on a G80, fewer on G84). So, something that runs on a G80, can very well deadlock on a G84. Basically, it’s not very portable and a good way to shoot yourself in a foot.

Paulius

To me, inter-block synchronization is mostly a way to get around G80’s lack of atomic operations. It doesn’t have to work on any other chip.
Admittedly, it’s so slow and quirky that I’ve never used it up to now, but it may remain useful as long as G80 continues to be used.

[ __syncblocks() Updated for N THREADS M BLOCKS (no limits?) ]

jpape,

The trick behind G[0] = 2 is to issue a global write.

However this memory is allocated by the host on the GPU.

Technically you could substitute a message passing structure here to create a global barrier.

Though most people would differ on this hack it works for the following scenarios

WAIT(S,G) is not used in loops (Hangs up my machine,still working on it though)

For Parallel computations and serial writes.

For parallel computations and serial reads after writes.

Note that i have changed a few definitions

#define CREATE_PID int PID=threadIdx.x + (blockIdx.x*blockDim.x); int PTOTAL = blockDim.x*gridDim.x  

#define MASTER 0

#define MASTER_REGION if(PID==MASTER) 

#define CREATE_SEMAPHORE(x) __device__ int x; MASTER_REGION{x=MASTER-1;}

//#define SIGNAL(S) if(threadIdx.x==0){ (*S)++; } __syncthreads();

#define CREATE_BARRIER(B) __device__ int B; MASTER_REGION{B=MASTER;}

#define WAIT_BARRIER(B) B++;do{}while(B!=PTOTAL);

//Critical Here 

#define WAIT(S,G) while((*S)!=blockIdx.x-1){G[0] = 2;}

#define SIGNAL(S) if(threadIdx.x==0){if((*S)==gridDim.x-2){(*S) = -1; }else{(*S)++;} } //__syncthreads();

#define __syncblocks(S,B,G)  WAIT(S,G);if(blockIdx.x == 15){if(threadIdx.x == 0){B[0] = 1;}}SIGNAL(S,G);WAIT(S,G);if(blockIdx.x==0){__syncthreads();while(B[0]==0){G[0] = 1;}if(threadIdx.x==0){B[0] = 1;}}SIGNAL(S,G);

Here is a trivial example for the implementation:

#define SERIES_THREADS 100 

#define SERIES_BLOCKS 16

//Kernel for demonstrating adding of series 

//using block syncronization

//Parameters=======================================

// S : Semaphore Variable initialized to -1

// fSeries : Series on which addition is to be computed

// fAns    : Final Summed up answer

// iLen    : Length of series

//==================================================

__global__ void cuAddSeries(int *S,char *G,float *fSeries,float *fAns,int iLen)

{

int iStart; //Start offset of thread payload

int iEnd;   //end offset of thread payload

int iBlockPayload;//Number of Elements computed by a block (MP)

int iThreadPayload;//Number of elements computed by a thread (TP)

int iE;//Itterator

//Calculate Payloads and offsets

//===============================================

iBlockPayload = iLen / gridDim.x; 

iThreadPayload =iBlockPayload  / blockDim.x; 

iStart = iBlockPayload * blockIdx.x;

iStart+=  iThreadPayload * threadIdx.x;

iEnd = iStart + iThreadPayload;

//===============================================

	

__shared__ float fLocalSum[SERIES_THREADS]; //Reduced Sums computed by threads

__shared__ float x[SERIES_THREADS];

__shared__ float fBlockSum; //Reduced Sum calculated by block

fLocalSum[threadIdx.x] = 0.0f; //Initialize sum to 0

//For all SIMD units (threads)

 for(iE=iStart;iE<iEnd;iE++)

 {

  fLocalSum[threadIdx.x]+=fSeries[iE];

  

 }

__syncthreads(); //Local Barrier

//Sum up by master thread only

 if(threadIdx.x==0)

 {

  fBlockSum = 0.0f;

  for(iE=0;iE<SERIES_THREADS;iE++)

  {

  fBlockSum+=fLocalSum[iE]; //Block will sum up

  }

}

WAIT(S,G);//Start critical section 

   //Master Threads of each block will write sums at a time

  if(threadIdx.x==0)

  {

    (*fAns) = (*fAns) + fBlockSum;

  }

 SIGNAL(S);//End Critical region

 

}

Though usually one would use scan driven techniques here, i observed a scale up of over 30 times compared to E6600 over clocked to 2.5 Ghz.

I would treat this as a brute tool rather than a strict method to work around general purpose problems.

The hardware should have incorporated a state machine for implementing this but i assume this does not fit in the streaming programming model.

Cheers,

Neeraj

The code is not guaranteed to run correctly even with #threadblocks == #multiprocessors, since your “synchronization primitives”, updating the same memory address, are not atomic, i.e. take multiple instructions. You might try using atomic*() functions in order to achieve what you want in reliable way, though atomics are only Compute capability 1.1 feature.

Once again, such programming approach is not encouraged, since, just as you mentioned, it contradicts the programming model.

Let us for a minute assume the spin loop works for all hardware variants of the G80.

Now,

Notice the loop in the defination of WAIT(S,G)

while( ( *S ) != blockIdx.x-1 )

{

G[0] = 2;

}

The value i am updating is a dummy write, the value of variable G is of no consequence.

I am assuming i introduce a latency of 300-400 clock cycles within the loop. ( correct me if the compiler is optimizing this )

So even if it is not a atomic write, i safely assume

that only one block will enter the critical region provided i introduce the fixed time latency.

When leaving the critical region only one multi processor updates the value of semaphore S (Ironically atomic operation ).

Again what i am really looking for is not atomic operations ,but hardware support for Inter - processor synchronization.

Adding this in the future hardware variants would save a lot of round - trips to the cpu for critical real time kernels.

volatile was exactly what I needed for my purposes. Thanks.

Hi,

I’m trying to synch data between blocks per iteration of time (in a for loop) for many particles (> 10000) (MD calculations).
I have achieved thread level parallelism but inter block communication is quite troublesome. After each loop all locations of the molecules must be updated and all processors must take the data back for the 2nd loop.

Using locks is the best way to achieve this. I have a 1.1 compute ability GPU.

Is it technically feasible to perform atomicAdds onto a semaphore variable to create a barrier? I tried using a simple snip of code but it deadlocks. I read in this thread that the variable has to be volatile?

code snippet:
atomicAdd(g_count,1);
while((int)g_count[0] < 8) { nop; }
if (g_count != 0) { g_count = 0; }

Syncing between blocks is quite troublesome, even with 1.1 hardware.

Let’s say you have a program with more then 16 blocks. As far as I know, no 1.0 (or 1.1) hardware can run more then 16 blocks simultaneously, so these blocks must be scheduled. The blocks scheduler will then choose arbitrary blocks to run on the multi-processors. These blocks are run in batches (i.e. they start running and keep running until they finish). This leads to horrible potential for deadlock.

Let me give an example.

Let’s say we have a 2 multi-processor Nvidia board with a program that has 3 blocks where block 1 has to finish first. If the scheduler picks blocks 2 and 3 to put on the multi-processors then the program will deadlock even with a perfect semaphore implementation.

If you expand the example to larger numbers of blocks/dependencies then you can see the problem.

What we need in order to synchronize many blocks is more then just atomic functions, we need an implementation similar to ‘yeild’ in unix. Something that can take a block off a multi-processor and save all it’s state to be re-run later. As we do not have a way to do that we can not make programs that run effectively with arbitrary block-dependencies with more blocks then can run on
the hardware at a time.

Well in any case, does Nvidia have a stance on this issue?

I wonder if quitting the kernel and then restarting it is good enough to synchronize global memory. If this is ok then I can call the kernel multiple times to achieve what I want. The problem is when the kernel quits, does memory on the card automatically become deallocated?? If not then I dont see why you cannot just keep calling kernels until the data is at a point where u want.

Although there is most likely overhead from constant re-execution, at least it will get the job done??

If not then I am either stuck in the limit of one block or must copy all data out then in again.

Global memory does not become deallocated unless you explicitly do it, or your entire program exits. Shared memory contents are not persistent, though.

Paulius from Nvidia has already weighed in on this issue.

If you need to force an execution order of blocks, then issuing multiple kernels will do the job. Just be aware that there is significant startup overhead (depending on your problem) in doing this. You need to make sure that each of the stages do enough work to make up for the overhead.