Inter process communication problem Stuck!

Hi people,
Ok here goes

I am trying to write a message passing mechanism using CUDA .

I started out by writing a classic semaphore where i wanted to serialize the writing process of each thread to global memory.

global void Kernel(void)
{
//Assign a unique Id to each processing element
int PID=threadIdx.x + (blockIdx.x*blockDim.x);

//Global Member
device int Semaphore;

//Master Region accessible by only the master thread
if(PID==0)
{
Semaphore = -1; //Initialize semaphore
}

//Wait for entering into critical region

while(Semaphore!=PID-1); //WAIT(S)

//Critical Region Here (only 1 thread enters here at a time )

Semaphore++; //SIGNAL(S)
//The next thread can now enter into the critical region

}

The launch configuration was :

#define BLOCKS 16
#define THREADS 100

//Configure
dim3 DimGrid,DimBlock;
DimGrid.x = BLOCKS;
DimBlock.x = THREADS;

//Launch the Grid
Kernel<<<DimGrid,DimBlock>>>();

When i executed the code machine hanged up for precisely 4.2 seconds ( i was running a background timer). The control returned but with an exception fired.
“Microsoft C++ exception: cudaError_enum at memory location 0x0013fea4”

Guys what do you think, also there is no hardware support for TEST(var) instruction.
no control over the thread scheduler too, what do i do? the problem i am trying to solve requires critical section handling.

Obviously i am doing something wrong… :blink:

P.S : I am running this on a GTX version.

[EDIT] Problem solved will post the solution in a new thread :)

Due to the SIMD nature of GPUs you can not assume that addition instructions are atomic.

while(Semaphore!=PID-1); //WAIT(S)

//Critical Region Here (only 1 thread enters here at a time )

Semaphore++; //SIGNAL(S)

//The next thread can now enter into the critical region

This code will work (most of the time) on a CPU, as the scheduler rarely schedule a different program between the while() and the increment operations. In the cases where it does, it will even more rarely schedule in a different thread where the while() is being executed. In the few cases where this does happen, you will have deadlock. Atomic operations (like atomic test-and-set on the CPU) are required for a real implementation of a semaphore.

In the GPU, however this does not work so well.

When you get to the while loop at least one SIMD group (warp) will execute it at once. At least 8 processing elements will look at the semaphore variable in the same clock cycle. All of them will either pass or loop, and all 8 of them will increment the semaphore variable at the same time, leading to indeterminate results being stored in that variable for hardware reasons.

In the simulator you will not see this problem, as it has a warp size of 1.

For the GPU, no serialization method that I am aware of other then __syncthreads() exists on 1.0 compute capability hardware. The 8600 is of compute capability 1.1, and has atomic operations so a semaphore could be implemented there, although the 8600 is quite a bit slower then the 8800.

My suggestion is to try re-mapping your problem into thread/block groups so that synchronization is only required within a block, then use __syncthreads(). Other then that, I would wait for 1.1 hardware.

Thanks Jeff i appreciate it.

Can you confirm that no atomic operations are possible for inter block (Global)
synchronization. Is this a hardware issue or a ‘no support by compiler’ issue?

If it was me i would have provided support for at least one atomic operation ; also
“TEST_AND_SET(VAR)” equivalent instruction would have solved all the problems.

Now i have to use the CPU to collate all the data written by the blocks, this seems silly looking at the bandwidth issues between Host to device memory transfers.

Could i solve this problem , if every block waits for a unique time and then polls the global variable?

Thanks

The G86 (8600 series, for ex) currently support atomic operations. Check the programming guide. I believe future generations will improve on this.

Do note that, depending on your application, even using atomic operations won’t save you. Imagine that you have blocks 2-100 waiting on block 1 to set a condition, then 1 waits until all of 2-100 are done to continue the next phase of the computation. Lets say that in this case, only 16 blocks can run concurrently on the device. Thus, your program deadlocks because no blocks after 16 can be run until one under 16 exits. The only sure way to do communication bewtween blocks is to break the computation into phases and run each in a separate kernel.

This has been discussed in detail. [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

In short, I will add that if your application needs a large amount of synchronization between blocks while it runs, then your application is not well adapted to the GPU. GPUs are all about data-parallel algorithms where you can run the same code on > 10,000 data elements at the same time, independently. It gets the massive performance boosts by running these calculations in an interleaved manner to hide memory latencies. Adding in synchronizations between the blocks will force your application to run practically in serial on the device and will probably end up slower than a single CPU implementation.

Thanks folks for the replies!

I did manage to design a Semaphore which works very reliably for 16 blocks 8 threads
(16B,8T)!! Without using Atomic operations
It also seems to work on (80B,16T) but i will not vouch for its reliability. I will post this code in a entirely new thread.

It is possible to implement :
[A] __syncblocks() (Global barrier)
[B] Master regions (Only accessible by one block where other blocks wait (spinlocks)
[C] Critical constructs (Accessible by a single block )

This hopefully will avoid round trips to CPU for collation.
I will confirm results of avoiding round trips and time benchmarks! :)

Sorry to say, but this still isn’t useful…

  1. My test shows ~256T is required for blockwise sync to be worthwhile. Otherwise, it’s either beaten by a trip back to CPU (large data set), or beaten by two memcpys and an entirely CPU implementation (small data set).
  2. Once I wrote a blocksync that passes a simple test. But when I tried it in a real problem, it fails. I can’t find why. According to the # of registers in cubin (<10), the 8B 256T configuration SHOULD have been launched simultaneously. To make such a thing useful, you HAVE TO test it in a real problem (for example, scan).

If your solution could work, I’d appreciate it a lot. I have a 2-pass, small # of blocks scan handy (now 2x~3x faster than the one in the SDK for large data, and use O(1) extra space instead of O(n)). If you could make it one pass, that thing may become useful (which means, beat CPU+memcpy) for smaller datasets that I usually deal with.

if your algorithm permits try increasing the number of blocks but keep the number of threads to a maximum of 8, you might still get real time gains.

I am still working on another version.

Cheers

Neeraj

Well… 8 threads is not a good idea. After all, 32B 8T + __syncblocks() is incompetent against 1B 256T + __syncthreads().
Have you tried it on any real problem yet? I’d like to see a benchmark of some actual algorithm implemented using 8 threads.
GPU does seem to need more threads to hide latency.

any thing bellow 16 threads per block should make no difference, and im pretty sure that its true for 32. The way the gpu works is preaty much like ??? but on a much larger scale, so every clock all 16 threads will do exactly the same thing (and then then next 16 do the same thing cause they run at twice the speed). For that reason syncthreads has no meaning when running with less then 32 threads. Only when you have more then 32 threads it has a meaning.

i also need to do some collapsing in my implemintation and would love to have a block level sync, or an option to leave data in the shared memory between kernels.

Each SIMD group on the video card has 8 processing elements. We have seen some algorithms that show significant speedups by running exactly 8 threads per block. Other algorithms work best with as many threads per block as the algorithm can handle.

I would recommend playing around with the number of threads vs blocks in order to find the “sweet spot” for the algorithm you are working on.

Well… exactly what are they?