SYNCHRONIZATION OF THREADBLOCKS NEEDED

We have a need for threadblock synchronization. Here is a shortened piece of one of our kernels:

attributes(global) subroutine tridiag_compRUB_kernel &
(Xdev,Ldev,Udev,Ddev,Bdev,N,pausaliter)
use cudafor
implicit none
real,device :: Xdev(N)
real,device :: Ldev(N) ,Udev(N) ,Ddev(N) ,Bdev(N)
real,shared :: Xsub(512)

… other declarations …

integer :: tx,counter


tx = threadidx%x

if(threadidx%x.eq.1) counter = 0

do while (counter < pausaliter)

if(tx.eq.1) counter = counter + 1


if(tx.EQ.1)Xsub(1)=Xdev(ibsum)
if(tx.EQ.blockdim%x)Xsub(blockdim%x+2)=Xdev(ibsum+blockdim%x+1)


… some code …

call syncthreads()
! SOME CALCULATION:
Xsub(tx+1)=((Bdev(ibsum+tx)-Ldev(ibsum+tx)*Xsub &
(tx)-Udev(ibsum+tx)*Xsub(tx+2)) &
/Ddev(ibsum+tx))*relax + (1.-relax)*Xsub(tx+1)

call syncthreads()


! AFTER MATRIX DECOMPOSITION AMONG THREADBLOCKS, ONLY SOME BOUNDARY VALUES OF A THREAD BLOCK (I.E. VALUES CALCULATED BY FIRST AND 512TH THREAD) ARE TRANSFERED TO GLOBAL MEMORY. THESE REFRESHED VALUES ARE WRITTEN TO Xdev(ibsum+1) AND Xdev(ibsum+blockdim%x) ON THE GLOBAL MEMORY AND THEY ARE AVAILABLE FOR OTHER THREADBLOCKS IN THE NEXT DO WHILE PASS.

!THREAD 1 WRITES THESE VALUES ON DEVICE MEMORY:
if(tx.EQ.1) then
Xdev(ibsum+1)=Xsub(1)
Xdev(ibsum+blockdim%x) = Xsub(blockdim%x)
endif
enddo !end of DO WHILE

Question is how to make values written to Xdev(ibsum+1) AND Xdev(ibsum+blockdim%x) ON THE GLOBAL MEMORY visible to all other threadblocks when there is no synchronization amon them. I gues we would need something like _threadfence() as in C CUDA (CUDA Programming Guide Version 2.3.1 - Memory Fence Functions).

Sincerely,

Hi mcavrak,

Our engineers are working on adding atomics and threadfence to the 10.5 release, but I think you might be able to just add in a few more syncthreads.

attributes(global) subroutine tridiag_compRUB_kernel &
(Xdev,Ldev,Udev,Ddev,Bdev,N,pausaliter)
use cudafor
implicit none
real,device :: Xdev(N)
real,device :: Ldev(N) ,Udev(N) ,Ddev(N) ,Bdev(N)
real,shared :: Xsub(512)

.... other declarations ...........

integer :: tx,counter

tx = threadidx%x

if(threadidx%x.eq.1) counter = 0
call syncthreads()  ! make sure all the threads have the same value for counter.

do while (counter < pausaliter)

if(tx.eq.1) the
   counter = counter + 1  ! You could use and atomicAdd here
    Xsub(1)=Xdev(ibsum)
    Xsub(blockdim%x+2)=Xdev(ibsum+blockdim%x+1) 
 ! potential bug if blockdim%x > 510 since Xsub only has 512 elements.  
 ! Should Xsub have 514 elements? Should the +2 be removed? 
 ! Are there only 510 threads?
endif
call syncthreads()

............... some code .............

call syncthreads()
! SOME CALCULATION:
! Xsub(tx+1)=((Bdev(ibsum+tx)-Ldev(ibsum+tx)*Xsub &
! (tx)-Udev(ibsum+tx)*Xsub(tx+2)) &
! /Ddev(ibsum+tx))*relax + (1.-relax)*Xsub(tx+1)
! This code seems potentially dangerous since your updating values on the LHS
! that are needed by other threads on the RHS.  
! The values in Xsub will depend upon the order in which the threads are executed.
! If you need to keep this dependency, then this calcuation should be done serially 
! by a single thread.

! However, assuming that you only want to work with the old values of Xsub and
! are not relying on order, maybe this might work better?

tmpval = ((Bdev(ibsum+tx)-Ldev(ibsum+tx)*Xsub &
 (tx)-Udev(ibsum+tx)*Xsub(tx+2)) &
 /Ddev(ibsum+tx))*relax + (1.-relax)*Xsub(tx+1)
call syncthreads()
Xsub(tx+1) = tmpval

call syncthreads()


! AFTER MATRIX DECOMPOSITION AMONG THREADBLOCKS, ONLY SOME BOUNDARY VALUES OF A THREAD BLOCK (I.E. VALUES CALCULATED BY FIRST AND 512TH THREAD) ARE TRANSFERED TO GLOBAL MEMORY. THESE REFRESHED VALUES ARE WRITTEN TO Xdev(ibsum+1) AND Xdev(ibsum+blockdim%x) ON THE GLOBAL MEMORY AND THEY ARE AVAILABLE FOR OTHER THREADBLOCKS IN THE NEXT DO WHILE PASS.

!THREAD 1 WRITES THESE VALUES ON DEVICE MEMORY:
if(tx.EQ.1) then
Xdev(ibsum+1)=Xsub(1)
Xdev(ibsum+blockdim%x) = Xsub(blockdim%x)
endif

enddo !end of DO WHILE

With all the synchronization, I do worry about your performance. Though, getting it working correctly is the first step. Performance comes second.

  • Mat

First of all thanks Mat for quick reply,

\

  1. Why do I need all threads to have the same value of the counter that is locally defined by each thread and not in shared or device memory?

integer :: tx,counter

tx = threadidx%x

if(threadidx%x.eq.1) counter = 0
call syncthreads() ! make sure all the threads have the same value for counter.

do while (counter < pausaliter)

if(tx.eq.1) then counter = counter + 1
! You could use and atomicAdd here
!--------------------------------------------------
!== The same is here. Atomics are used for reductions on the device level, isnt’t that right?
!--------------------------------------------------

enddo


2. The main question was if there is a way that writes to device memory performed by a thread inside one block is seen for reading to shared memory by a thread in another block performed in the next while loop.
By this I mean to make a fence at the end of a previous loop for others to access new refreshed values at the begining of a next loop.

Example:

real,shared :: Xsub(512+2)
! Xsub is 514 :
idx = 1 for first interface data
(calculated by a last thread in the previous block blockidx%x-1)
idx = 514 for second interface data
(calculated by a first thread in the next block blockidx%x+1)


id = (blockidx%x-1)*blockdim%x

! Read corresponding segment of Xdev from a global memory to Xsub (2:513)
Xsub(threadidx%x+1) = Xdev(id) ! TRANSFER OF MAIN VECTOR

do while (counter < pausaliter)

! AN INTERFACE DATA READ FROM GLOBAL MEMORY
Xsub(1) = Xdev(id-1)
Xsub(blockdim%x+2)=Xdev(id+blockdim%x+1)


!.. SOME WORK

! AN INTERFACE DATA STORE TO GLOBAL MEMORY
Xdev(id) = Xsub(1)
Xdev(id+blockdim%x) = Xsub(blockdim%x)

!! POTENTIAL LOCATION FOR A threadfence() command
! OR SOME OTHER

enddo

! Write Xsub(2:513) to a corresponding segment of Xdev in a global memory
Xdev(id) = Xsub(threadidx%x+1)


So should we wait for 10.5 to address a synchronization of threadblocks or is there another way.

Sincerely,


Marko & Lado

Hi Marko & Lado,

Why do I need all threads to have the same value of the counter that is locally defined by each thread and not in shared or device memory?

Sorry, I should have looked how you had counter declared. Your use of the guard around counter lead me to believe it was shared. In this case, you should remove the guards else all the threads except one will enter an infinite loop. (unless counter is incremented for the other threads elsewhere in the code).


  1. The main question was if there is a way that writes to device memory performed by a thread inside one block is seen for reading to shared memory by a thread in another block performed in the next while loop.
    By this I mean to make a fence at the end of a previous loop for others to access new refreshed values at the begining of a next loop.

Where you want to put a threadfence or syncthreads call is before you need to access the shared data. I’d also only have one thread make the reads and writes to global memory.

! Double check this, since your id will be the same for all threads in this block
! id = (blockidx%x-1)*blockdim%x
! I think you want a unique index for each thread
id  = (blockIdx%x-1)*blockDim%x + threadIdx%x


! Read corresponding segment of Xdev from a global memory to Xsub (2:513)
Xsub(threadidx%x+1) = Xdev(id) ! TRANSFER OF MAIN VECTOR

do while (counter < pausaliter)

! AN INTERFACE DATA READ FROM GLOBAL MEMORY
if (threadidx%x .EQ. 1) then
Xsub(1) = Xdev(id-1)
Xsub(blockdim%x+2)=Xdev(id+blockdim%x+1)
endif
call syncthreads()   ! have the threads wait until the write to Xsub is complete

!... SOME WORK

! AN INTERFACE DATA STORE TO GLOBAL MEMORY
call syncthreads()   ! Make sure Xsub's writes are complete for all threads. 
                               ! Could use threadshare here.
if (threadidx%x .EQ. 1) then 
   Xdev(id) = Xsub(1)
   Xdev(id+blockdim%x) = Xsub(blockdim%x)
endif

! you could put a syncthreads here, but the one at the top of the loop should suffice.

enddo

! Write Xsub(2:513) to a corresponding segment of Xdev in a global memory
Xdev(id) = Xsub(threadidx%x+1)
  • Mat