Question regarding CUDA streams

Aditi · May 20, 2009, 12:52am

Hi,

I have a code in which threads from different blocks could sometimes be writing to the same memory location. To be on a safer side, I want to synchronize the threads from all blocks. When I do it by killing the kernel, the output is correct but there is a lot of overhead. I thought stream programming could be a solution. But given my application, it would be a solution only if different cudaStreamkernels (i.e. kernels belonging to different streams performing different operations) could be launched. But this doesn’t seem like possible apparently (I see that different streams would operate on the same global device_kernel). Is there a way I could do it? Some sort of switches or any other provision?

Thanks & regards,

Aditi

Aditi · May 20, 2009, 1:32am

I tried one more thing to synchronize/co-ordinate between the threads from different blocks. I tried using atomicAdd to increment a counter (by 1) by the zeroth thread of every block. Counter value equal to number of blocks would mean that all blocks have crossed the “barrier” of that step. But this did not seem to help. When I use the check if(counter==number of blocks), the steps under this if statement do not get executed. The value of counter increases in order though. Can someone suggest any such technique that I can use to synchronize the different blocks?

Thanks & regards,

Aditi

Aditi · May 20, 2009, 1:58am

Please note that there is no shared memory involved right now. Its only global memory that I am dealing with for now and which the various blocks are trying to access.

Sarnath · May 20, 2009, 6:38am

Use atomicCAS() to implement spin-locking in your kernel.

Your atomicAdd() logic wont work beause of the block scheduling nature of the hardware. Blocks are scheduled in batches - one after another following completion. If one set stalls --waiting for blocks in another set – you have a deadlock. This block scheduling logic is UN_DOCUMENTED. Just that a few people over here experimented and nailed this down and can change in future hardware releases. Dont build your design on this logic.

So, instead begin like this:

Have this lock as a 32-bit or 64-bit value initializd to 0.
Whenever a thread wants to lock use this:

while(atomicCAS(&gmem, 0, 0xFF) == 0xFF);

   -- critical section --

   gmem = 0;

Atomic operations are PERFORMANCE KILLERS. Be aware of it.

Aditi · May 21, 2009, 8:27am

Use atomicCAS() to implement spin-locking in your kernel.

So, instead begin like this:

Have this lock as a 32-bit or 64-bit value initializd to 0.

Whenever a thread wants to lock use this:
while(atomicCAS(&gmem, 0, 0xFF) == 0xFF);

   -- critical section --

   gmem = 0;
Atomic operations are PERFORMANCE KILLERS. Be aware of it.

Hi,

Thanks so much for your reply. It makes a lot of meaning and I have coded the mutex+global_synchronization_barrier based on your suggestion. But I am still facing a problem which is more likely due to a deadlock situation. Please note the attached code. The code hangs at the first __syncthreads() (line 109 with the comment: /emulation mode hangs here/) in the kernel. Without the mutex+barrier, the code works normally in the execution mode but gets stuck in the emulation mode. With the mutex+barrier, the code executes (does not hang in the execution mode) without changing “filtimage_d” while it hangs in the emulation mode. Having __syncthreads() around the common variables being read (like row- and col- coefficients) also does not help. The indexes of filtimage_d are such that no two threads should be modifying the same location at any point of time and hence all threads should reach __syncthreads(). The problem is the same even with BIN_QTY==1 and tilesize=300 i.e. just 1 block of 300threads in all. The problem remains the same even if I forget global synchronization and distribute the two loops/chunks over two separate kernels (i.e. global synchronization by killing the kernel).

How does the code work in execution mode but not in the emulation mode? Deadlocks? Any pointers to prevent such situation? I am stuck on this since long and any pointer will be highly appreciated.

Thanks & regards,

Aditi
dericheCSDEMD.txt (8.24 KB)

Topic		Replies	Views
Synchronize all blocks in CUDA CUDA Programming and Performance	12	46342	October 25, 2013
Global Sync CUDA Programming and Performance	7	6044	October 4, 2007
Synchronization problem How can we synchronize blocks? CUDA Programming and Performance	10	5300	December 4, 2007
Comp Capability 1.0 Multiprocessor Synchronization Can it be done? CUDA Programming and Performance	5	3962	July 23, 2007
syncronize all threads from all blocks cudaThreadSynchronize() the only way ? CUDA Programming and Performance	11	8278	November 15, 2010
Synchronization among blocks CUDA Programming and Performance	9	3699	May 4, 2010
cuda block synchronization CUDA Programming and Performance	1	1006	June 19, 2011
__syncblocks 101 Primitives for Interblock syncronization CUDA Programming and Performance	16	10052	February 29, 2008
A global barrier for blocks the barrier is failing... CUDA Programming and Performance	4	1637	February 6, 2010
About Synchronize CUDA Programming and Performance	4	1468	March 26, 2009

Question regarding CUDA streams

Related topics