[ __syncblocks() Updated for N THREADS M BLOCKS (no limits?) ]
jpape,
The trick behind G[0] = 2 is to issue a global write.
However this memory is allocated by the host on the GPU.
Technically you could substitute a message passing structure here to create a global barrier.
Though most people would differ on this hack it works for the following scenarios
WAIT(S,G) is not used in loops (Hangs up my machine,still working on it though)
For Parallel computations and serial writes.
For parallel computations and serial reads after writes.
Note that i have changed a few definitions
#define CREATE_PID int PID=threadIdx.x + (blockIdx.x*blockDim.x); int PTOTAL = blockDim.x*gridDim.x
#define MASTER 0
#define MASTER_REGION if(PID==MASTER)
#define CREATE_SEMAPHORE(x) __device__ int x; MASTER_REGION{x=MASTER-1;}
//#define SIGNAL(S) if(threadIdx.x==0){ (*S)++; } __syncthreads();
#define CREATE_BARRIER(B) __device__ int B; MASTER_REGION{B=MASTER;}
#define WAIT_BARRIER(B) B++;do{}while(B!=PTOTAL);
//Critical Here
#define WAIT(S,G) while((*S)!=blockIdx.x-1){G[0] = 2;}
#define SIGNAL(S) if(threadIdx.x==0){if((*S)==gridDim.x-2){(*S) = -1; }else{(*S)++;} } //__syncthreads();
#define __syncblocks(S,B,G) WAIT(S,G);if(blockIdx.x == 15){if(threadIdx.x == 0){B[0] = 1;}}SIGNAL(S,G);WAIT(S,G);if(blockIdx.x==0){__syncthreads();while(B[0]==0){G[0] = 1;}if(threadIdx.x==0){B[0] = 1;}}SIGNAL(S,G);
Here is a trivial example for the implementation:
#define SERIES_THREADS 100
#define SERIES_BLOCKS 16
//Kernel for demonstrating adding of series
//using block syncronization
//Parameters=======================================
// S : Semaphore Variable initialized to -1
// fSeries : Series on which addition is to be computed
// fAns : Final Summed up answer
// iLen : Length of series
//==================================================
__global__ void cuAddSeries(int *S,char *G,float *fSeries,float *fAns,int iLen)
{
int iStart; //Start offset of thread payload
int iEnd; //end offset of thread payload
int iBlockPayload;//Number of Elements computed by a block (MP)
int iThreadPayload;//Number of elements computed by a thread (TP)
int iE;//Itterator
//Calculate Payloads and offsets
//===============================================
iBlockPayload = iLen / gridDim.x;
iThreadPayload =iBlockPayload / blockDim.x;
iStart = iBlockPayload * blockIdx.x;
iStart+= iThreadPayload * threadIdx.x;
iEnd = iStart + iThreadPayload;
//===============================================
__shared__ float fLocalSum[SERIES_THREADS]; //Reduced Sums computed by threads
__shared__ float x[SERIES_THREADS];
__shared__ float fBlockSum; //Reduced Sum calculated by block
fLocalSum[threadIdx.x] = 0.0f; //Initialize sum to 0
//For all SIMD units (threads)
for(iE=iStart;iE<iEnd;iE++)
{
fLocalSum[threadIdx.x]+=fSeries[iE];
}
__syncthreads(); //Local Barrier
//Sum up by master thread only
if(threadIdx.x==0)
{
fBlockSum = 0.0f;
for(iE=0;iE<SERIES_THREADS;iE++)
{
fBlockSum+=fLocalSum[iE]; //Block will sum up
}
}
WAIT(S,G);//Start critical section
//Master Threads of each block will write sums at a time
if(threadIdx.x==0)
{
(*fAns) = (*fAns) + fBlockSum;
}
SIGNAL(S);//End Critical region
}
Though usually one would use scan driven techniques here, i observed a scale up of over 30 times compared to E6600 over clocked to 2.5 Ghz.
I would treat this as a brute tool rather than a strict method to work around general purpose problems.
The hardware should have incorporated a state machine for implementing this but i assume this does not fit in the streaming programming model.
Cheers,
Neeraj