thread-local variables

I have a kernel subroutine like this:

attributes(global) subroutine FrontSweep_cuda       

 integer,value :: d, s, w, n_ways
 integer,value :: i,j
 integer,value ::  fmi, fmj, fmk, fm_id
 real,value :: in_bound(3), out_bound(3), lax3_factor(3)
 real, value ::  bot, top, avg, IsZero

 do s=3, ncap_gpu

 n_ways=ini_stage(s+1)-ini_stage(s)

 do w=blockidx%x, n_ways, griddim%x
 do d=threadidx%x, num_dir_gpu, blockdim%x 
 
     if(o_pnt_gpu(1) .lt. 0) then
        fmi=cap_gpu(1)- x_draw( ini_stage(s)+w-1 )+1
     else
        fmi=x_draw( ini_stage(s)+w-1 )
     endif
   
     if(o_pnt_gpu(2) .lt. 0) then
        fmj=cap_gpu(2)- y_draw( ini_stage(s)+w-1 )+1
     else
        fmj=y_draw( ini_stage(s)+w-1 )
     endif
		  
     if(o_pnt_gpu(3) .lt. 0) then
        fmk=cap_gpu(3)- z_draw( ini_stage(s)+w-1 )+1
     else
        fmk=z_draw( ini_stage(s)+w-1 )
     endif
		  
     in_bound(1)=iflux(fmj,fmk,d)
     in_bound(2)=jflux(fmi,fmk,d)
     in_bound(3)=kflux(fmi,fmj,d)
 
     fm_id=m_matrix_gpu(fmi,fmj,fmk)
     IsZero=0
     lax3_factor=1
     bot=sigt_gpu(fm_id)
     top=asrcflx_gpu(fmi,fmj,fmk,d)
     do i=1,3
       bot=bot+2*cos_dager_gpu(i,d)
       top=top+2*cos_dager_gpu(i,d)*in_bound(i)
     enddo

  do while (IsZero .eq. 0)
   
    avg=top/bot
    IsZero=1
   
    do i=1,3
      if(lax3_factor(i) .eq. 0) cycle
    
      out=2*avg-in_bound(i)
    
      if (out .lt. 0)  then
        out_bound(i)=0.0
        lax3_factor(i)=0
        top=top-cos_dager_gpu(i,d)*in_bound(i)
        bot=bot-2*cos_dager_gpu(i,d)
        IsZero =0 
        exit
      else
        out_bound(i)=out 
      endif
   enddo
  enddo !do while

  iflux(fmj,fmk,d)=out_bound(1)
  jflux(fmi,fmk,d)=out_bound(2)
  kflux(fmi,fmj,d)=out_bound(3)

  asrcflx_gpu(fmi,fmj,fmk,d)=avg 

 enddo !dir
 
 enddo !ways
 
 call syncthreads()
 enddo !stage
 
 return
end subroutine

all the variables I defined in the beginning are intended to be local to a thread, which means each thread will see different values of these variables.
I wonder if the complier can recognize them as thread-local, including in_bound(3), out_bound(3), and lax3_factor(3) ?

Hi tty103,

I wonder if the complier can recognize them as thread-local, including in_bound(3), out_bound(3), and lax3_factor(3) ?

All local variables are local to each thread so these three variables do not share storage across multiple threads. For shared storage within a block of threads, you would need to explicitly declare the variables with the ‘shared’ attribute.

  • Mat

thanks. another question, sorry if this is asked before

module mCuda
real, allocatable, device :: MyArray(:,:,:)

attribute(global) subroutine test_cuda

integer i,j,k

do i=threadidx%x , 100, blockdim%x
do j=threadidx%y , 100, blockdim%y
do k=threadidx%z,  100, blockdim%z

  MyArray(i,j,k)=1

enddo
enddo
enddo

end subroutine
end module

i, j, k are local to thread, I wonder how the complier knows MyArray is not, which means I only need to allocate one copy of MyArray, not one for every thread.

if I pull the i, j, k out of the subroutine

module mCuda
real, allocatable, device :: MyArray(:,:,:)
integer i, j, k

attribute(global) subroutine test_cuda

do i=threadidx%x , 100, blockdim%x
do j=threadidx%y , 100, blockdim%y
do k=threadidx%z,  100, blockdim%z

  MyArray(i,j,k)=1

enddo
enddo
enddo

end subroutine
end module

does the compile still know i,j,k are local?

i, j, k are local to thread, I wonder how the complier knows MyArray is not, which means I only need to allocate one copy of MyArray, not one for every thread.

Because MyArray has module scope hence is visible to all routines within the module.

does the compile still know i,j,k are local?

But they aren’t local any longer. By moving them to the module data section, they are given module scope and hence accessible by all threads.

Also, they are host variables (no device attribute) so you’ll have problems access them on the device.

  • Mat