Inter process communication problem Stuck!

Neeraj_Kulkarni · August 19, 2007, 1:25pm

Hi people,
Ok here goes

I am trying to write a message passing mechanism using CUDA .

I started out by writing a classic semaphore where i wanted to serialize the writing process of each thread to global memory.

global void Kernel(void)
{
//Assign a unique Id to each processing element
int PID=threadIdx.x + (blockIdx.x*blockDim.x);

//Global Member
device int Semaphore;

//Master Region accessible by only the master thread
if(PID==0)
{
Semaphore = -1; //Initialize semaphore
}

//Wait for entering into critical region

while(Semaphore!=PID-1); //WAIT(S)

//Critical Region Here (only 1 thread enters here at a time )

Semaphore++; //SIGNAL(S)
//The next thread can now enter into the critical region

}

The launch configuration was :

#define BLOCKS 16
#define THREADS 100

//Configure
dim3 DimGrid,DimBlock;
DimGrid.x = BLOCKS;
DimBlock.x = THREADS;

//Launch the Grid
Kernel<<<DimGrid,DimBlock>>>();

When i executed the code machine hanged up for precisely 4.2 seconds ( i was running a background timer). The control returned but with an exception fired.
“Microsoft C++ exception: cudaError_enum at memory location 0x0013fea4”

Guys what do you think, also there is no hardware support for TEST(var) instruction.
no control over the thread scheduler too, what do i do? the problem i am trying to solve requires critical section handling.

Obviously i am doing something wrong… :blink:

P.S : I am running this on a GTX version.

[EDIT] Problem solved will post the solution in a new thread :)

Jeff_hagen · August 20, 2007, 7:06am

Due to the SIMD nature of GPUs you can not assume that addition instructions are atomic.

while(Semaphore!=PID-1); //WAIT(S)

//Critical Region Here (only 1 thread enters here at a time )

Semaphore++; //SIGNAL(S)

//The next thread can now enter into the critical region

This code will work (most of the time) on a CPU, as the scheduler rarely schedule a different program between the while() and the increment operations. In the cases where it does, it will even more rarely schedule in a different thread where the while() is being executed. In the few cases where this does happen, you will have deadlock. Atomic operations (like atomic test-and-set on the CPU) are required for a real implementation of a semaphore.

In the GPU, however this does not work so well.

When you get to the while loop at least one SIMD group (warp) will execute it at once. At least 8 processing elements will look at the semaphore variable in the same clock cycle. All of them will either pass or loop, and all 8 of them will increment the semaphore variable at the same time, leading to indeterminate results being stored in that variable for hardware reasons.

In the simulator you will not see this problem, as it has a warp size of 1.

For the GPU, no serialization method that I am aware of other then __syncthreads() exists on 1.0 compute capability hardware. The 8600 is of compute capability 1.1, and has atomic operations so a semaphore could be implemented there, although the 8600 is quite a bit slower then the 8800.

My suggestion is to try re-mapping your problem into thread/block groups so that synchronization is only required within a block, then use __syncthreads(). Other then that, I would wait for 1.1 hardware.

Neeraj_Kulkarni · August 20, 2007, 7:31am

Thanks Jeff i appreciate it.

Can you confirm that no atomic operations are possible for inter block (Global)
synchronization. Is this a hardware issue or a ‘no support by compiler’ issue?

If it was me i would have provided support for at least one atomic operation ; also
“TEST_AND_SET(VAR)” equivalent instruction would have solved all the problems.

Now i have to use the CPU to collate all the data written by the blocks, this seems silly looking at the bandwidth issues between Host to device memory transfers.

Could i solve this problem , if every block waits for a unique time and then polls the global variable?

Thanks

PINS · August 23, 2007, 10:06pm

The G86 (8600 series, for ex) currently support atomic operations. Check the programming guide. I believe future generations will improve on this.

MisterAnderson42 · August 23, 2007, 11:19pm

Do note that, depending on your application, even using atomic operations won’t save you. Imagine that you have blocks 2-100 waiting on block 1 to set a condition, then 1 waits until all of 2-100 are done to continue the next phase of the computation. Lets say that in this case, only 16 blocks can run concurrently on the device. Thus, your program deadlocks because no blocks after 16 can be run until one under 16 exits. The only sure way to do communication bewtween blocks is to break the computation into phases and run each in a separate kernel.

This has been discussed in detail. [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

In short, I will add that if your application needs a large amount of synchronization between blocks while it runs, then your application is not well adapted to the GPU. GPUs are all about data-parallel algorithms where you can run the same code on > 10,000 data elements at the same time, independently. It gets the massive performance boosts by running these calculations in an interleaved manner to hide memory latencies. Adding in synchronizations between the blocks will force your application to run practically in serial on the device and will probably end up slower than a single CPU implementation.

Neeraj_Kulkarni · August 24, 2007, 2:09pm

Thanks folks for the replies!

I did manage to design a Semaphore which works very reliably for 16 blocks 8 threads
(16B,8T)!! Without using Atomic operations
It also seems to work on (80B,16T) but i will not vouch for its reliability. I will post this code in a entirely new thread.

It is possible to implement :
[A] __syncblocks() (Global barrier)
[B] Master regions (Only accessible by one block where other blocks wait (spinlocks)
[C] Critical constructs (Accessible by a single block )

This hopefully will avoid round trips to CPU for collation.
I will confirm results of avoiding round trips and time benchmarks! :)

asadafag · August 25, 2007, 5:39am

Sorry to say, but this still isn’t useful…

My test shows ~256T is required for blockwise sync to be worthwhile. Otherwise, it’s either beaten by a trip back to CPU (large data set), or beaten by two memcpys and an entirely CPU implementation (small data set).
Once I wrote a blocksync that passes a simple test. But when I tried it in a real problem, it fails. I can’t find why. According to the # of registers in cubin (<10), the 8B 256T configuration SHOULD have been launched simultaneously. To make such a thing useful, you HAVE TO test it in a real problem (for example, scan).

If your solution could work, I’d appreciate it a lot. I have a 2-pass, small # of blocks scan handy (now 2x~3x faster than the one in the SDK for large data, and use O(1) extra space instead of O(n)). If you could make it one pass, that thing may become useful (which means, beat CPU+memcpy) for smaller datasets that I usually deal with.

Neeraj_Kulkarni · August 25, 2007, 5:51am

Sorry to say, but this still isn’t useful…

My test shows ~256T is required for blockwise sync to be worthwhile. Otherwise, it’s either beaten by a trip back to CPU (large data set), or beaten by two memcpys and an entirely CPU implementation (small data set).

Once I wrote a blocksync that passes a simple test. But when I tried it in a real problem, it fails. I can’t find why. According to the # of registers in cubin (<10), the 8B 256T configuration SHOULD have been launched simultaneously. To make such a thing useful, you HAVE TO test it in a real problem (for example, scan).

If your solution could work, I’d appreciate it a lot. I have a 2-pass, small # of blocks scan handy (now 2x~3x faster than the one in the SDK for large data, and use O(1) extra space instead of O(n)). If you could make it one pass, that thing may become useful (which means, beat CPU+memcpy) for smaller datasets that I usually deal with.

[snapback]241434[/snapback]

if your algorithm permits try increasing the number of blocks but keep the number of threads to a maximum of 8, you might still get real time gains.

I am still working on another version.

Cheers

Neeraj

asadafag · August 25, 2007, 6:04am

Well… 8 threads is not a good idea. After all, 32B 8T + __syncblocks() is incompetent against 1B 256T + __syncthreads().
Have you tried it on any real problem yet? I’d like to see a benchmark of some actual algorithm implemented using 8 threads.
GPU does seem to need more threads to hide latency.

Eri_Rubin · August 26, 2007, 8:35am

any thing bellow 16 threads per block should make no difference, and im pretty sure that its true for 32. The way the gpu works is preaty much like ??? but on a much larger scale, so every clock all 16 threads will do exactly the same thing (and then then next 16 do the same thing cause they run at twice the speed). For that reason syncthreads has no meaning when running with less then 32 threads. Only when you have more then 32 threads it has a meaning.

i also need to do some collapsing in my implemintation and would love to have a block level sync, or an option to leave data in the shared memory between kernels.

Jeff_hagen · August 26, 2007, 8:41am

Each SIMD group on the video card has 8 processing elements. We have seen some algorithms that show significant speedups by running exactly 8 threads per block. Other algorithms work best with as many threads per block as the algorithm can handle.

I would recommend playing around with the number of threads vs blocks in order to find the “sweet spot” for the algorithm you are working on.

asadafag · August 26, 2007, 1:51pm

Well… exactly what are they?

Topic		Replies	Views
__syncblocks 101 Primitives for Interblock syncronization CUDA Programming and Performance	16	10073	February 29, 2008
Synchronization among blocks CUDA Programming and Performance	9	3715	May 4, 2010
Comp Capability 1.0 Multiprocessor Synchronization Can it be done? CUDA Programming and Performance	5	3978	July 23, 2007
Issues with atomicAdd when doing Inter block sync CUDA Programming and Performance	12	4063	August 19, 2010
How to implement semaphores in CUDA? I tried with atomicExch(), but got a Run-time error CUDA Programming and Performance	6	21988	February 2, 2009
CUDA Memory Consistency CUDA Programming and Performance	23	55772	March 8, 2007
Global thread barrier CUDA Programming and Performance	78	85971	December 23, 2011
Synchronization problem How can we synchronize blocks? CUDA Programming and Performance	10	5319	December 4, 2007
Inter-threads communication CUDA Programming and Performance	3	1612	June 30, 2009
Question regarding CUDA streams CUDA Programming and Performance	4	2498	May 21, 2009

Inter process communication problem Stuck!

The launch configuration was :

//Launch the Grid Kernel<<<DimGrid,DimBlock>>>();

Related topics

//Launch the Grid
Kernel<<<DimGrid,DimBlock>>>();