Question regarding CUDA streams

Hi,

I have a code in which threads from different blocks could sometimes be writing to the same memory location. To be on a safer side, I want to synchronize the threads from all blocks. When I do it by killing the kernel, the output is correct but there is a lot of overhead. I thought stream programming could be a solution. But given my application, it would be a solution only if different cudaStreamkernels (i.e. kernels belonging to different streams performing different operations) could be launched. But this doesn’t seem like possible apparently (I see that different streams would operate on the same global device_kernel). Is there a way I could do it? Some sort of switches or any other provision?

Thanks & regards,

Aditi

I tried one more thing to synchronize/co-ordinate between the threads from different blocks. I tried using atomicAdd to increment a counter (by 1) by the zeroth thread of every block. Counter value equal to number of blocks would mean that all blocks have crossed the “barrier” of that step. But this did not seem to help. When I use the check if(counter==number of blocks), the steps under this if statement do not get executed. The value of counter increases in order though. Can someone suggest any such technique that I can use to synchronize the different blocks?

Thanks & regards,

Aditi

Please note that there is no shared memory involved right now. Its only global memory that I am dealing with for now and which the various blocks are trying to access.

Use atomicCAS() to implement spin-locking in your kernel.

Your atomicAdd() logic wont work beause of the block scheduling nature of the hardware. Blocks are scheduled in batches - one after another following completion. If one set stalls --waiting for blocks in another set – you have a deadlock. This block scheduling logic is UN_DOCUMENTED. Just that a few people over here experimented and nailed this down and can change in future hardware releases. Dont build your design on this logic.

So, instead begin like this:

  1. Have this lock as a 32-bit or 64-bit value initializd to 0.

  2. Whenever a thread wants to lock use this:

while(atomicCAS(&gmem, 0, 0xFF) == 0xFF);

   -- critical section --

   gmem = 0;

Atomic operations are PERFORMANCE KILLERS. Be aware of it.

Hi,

Thanks so much for your reply. It makes a lot of meaning and I have coded the mutex+global_synchronization_barrier based on your suggestion. But I am still facing a problem which is more likely due to a deadlock situation. Please note the attached code. The code hangs at the first __syncthreads() (line 109 with the comment: /emulation mode hangs here/) in the kernel. Without the mutex+barrier, the code works normally in the execution mode but gets stuck in the emulation mode. With the mutex+barrier, the code executes (does not hang in the execution mode) without changing “filtimage_d” while it hangs in the emulation mode. Having __syncthreads() around the common variables being read (like row- and col- coefficients) also does not help. The indexes of filtimage_d are such that no two threads should be modifying the same location at any point of time and hence all threads should reach __syncthreads(). The problem is the same even with BIN_QTY==1 and tilesize=300 i.e. just 1 block of 300threads in all. The problem remains the same even if I forget global synchronization and distribute the two loops/chunks over two separate kernels (i.e. global synchronization by killing the kernel).

How does the code work in execution mode but not in the emulation mode? Deadlocks? Any pointers to prevent such situation? I am stuck on this since long and any pointer will be highly appreciated.

Thanks & regards,

Aditi
dericheCSDEMD.txt (8.24 KB)