Nested loops in CUDA

I want to port such a code with 6 nested loops from CPU to GPU:

do msc=1,Ny
        do mtg=1,Ny
          do jsc=1,Nz
            do jtg=1,Nz
              do isc=1,Nx
                do itg=1,Nx    
                    indx=abs(itg-isc)+1
                    indz=abs(jtg-jsc)+1
                    hx (itg,jtg,mtg)=hx(itg,jtg,mtg)+ prodMsV(isc,jsc,msc)/dv(msc) *
     &               (mx(isc,jsc,msc)*wwmx(indx,indz,msc,mtg)
     &               +my(isc,jsc,msc)*wwmy(indx,indz,msc,mtg) 
     &               +mz(isc,jsc,msc)*wwmz(indx,indz,msc,mtg))
...
                enddo            
              enddo
            enddo
          enddo
        enddo
      enddo
[b][/b]

My kernels for GPU:



attributes(global) SUBROUTINE Comp_nested_gpu
  implicit none
  Integer N, M, L
  Integer i, j, k, bx, by, bz
  Integer isc,jsc,msc,indx,indz
  Real(8) MsVdv
   
  ! Get the thread indices
  bx = blockidx%x
  by = blockidx%y
  bz = blockidx%z
	
  j = (bx-1) * blockdim%x + threadidx%x
  k = (by-1) * blockdim%y + threadidx%y
  i = (bz-1) * blockdim%z + threadidx%z

  N = ubound(mxDev,1)-1
  M = ubound(mxDev,2)-1
  L = ubound(mxDev,3)

  if ((j<=N).AND. (k<=M) .AND. (i<=L)) then
    call DirSum_in(j,k,i)
  endif

 END SUBROUTINE Comp_nested_gpu

attributes(device) SUBROUTINE DirSum_in(itg,jtg,mtg) 
  implicit none
  Integer itg,jtg,mtg
  Integer N, M, L
  Integer bx, by, bz
  Integer isc,jsc,msc,indx,indz
  Real(8) MsVdv
   
  ! Get the thread indices
  bx = blockidx%x
  by = blockidx%y
  bz = blockidx%z
	
  isc = (bx-1) * blockdim%x + threadidx%x
  jsc = (by-1) * blockdim%y + threadidx%y
  msc = (bz-1) * blockdim%z + threadidx%z

  N = ubound(mxDev,1)-1
  M = ubound(mxDev,2)-1
  L = ubound(mxDev,3)

  if ((isc<=N).AND. (jsc<=M) .AND. (msc<=L)) then
		...
    MsVdv = prodMsVDev(isc,jsc,msc)/dvDev(msc)
    hxdemDev(itg,jtg,mtg)=hxdemDev(itg,jtg,mtg)+MsVdv* &
                                           (mxDev(isc,jsc,msc)*wwmxmxDev(indx,indz,msc,mtg)+ &
                                            myDev(isc,jsc,msc)*wwmxmyDev(indx,indz,msc,mtg)+ &		
                                            mzDev(isc,jsc,msc)*wwmxmzDev(indx,indz,msc,mtg))

		...

  endif

END SUBROUTINE DirSum_in

The GPU code works very quickly, but return wrong result.
Should I use only one kernel for such calculations?

Hi esemgn,

The problem here is that you’re only updating hxdemDev once. Instead, you’ll need to add the msc, isc, and jsc loops into the kernel and execute them sequentially per thread.

Also, I’d suggest using a scalar to accumulate the array so you’re not reading and writing from global memory:

Basically

These loops become your CUDA kernels:
        do mtg=1,Ny             
           do jtg=1,Nz  
                do itg=1,Nx

These loops become the body of your kernel:    
   sum = hx (itg,jtg,mtg)
   do msc=1,Ny 
          do jsc=1,Nz 
              do isc=1,Nx 
                    indx=abs(itg-isc)+1 
                    indz=abs(jtg-jsc)+1 
                    sum=sum+ prodMsV(isc,jsc,msc)/dv(msc) * 
     &               (mx(isc,jsc,msc)*wwmx(indx,indz,msc,mtg) 
     &               +my(isc,jsc,msc)*wwmy(indx,indz,msc,mtg) 
     &               +mz(isc,jsc,msc)*wwmz(indx,indz,msc,mtg)) 
... 
                enddo            
              enddo 
            enddo 
      hx (itg,jtg,mtg)=sum
          enddo 
        enddo 
      enddo

Hope this helps,
Mat

Hi Mat,

thank you for your answer!

I modified the kernel with your tips in such a way:

! Get the thread indices
	bx = blockidx%x
	by = blockidx%y
	bz = blockidx%z
	
	j = (bx-1) * blockdim%x + threadidx%x
	k = (by-1) * blockdim%y + threadidx%y
	i = (bz-1) * blockdim%z + threadidx%z
	N = ubound(mxDev,1)-1
	M = ubound(mxDev,2)-1
	L = ubound(mxDev,3)
	if ((j<=N).AND. (k<=M) .AND. (i<=L)) then
	  sumx = hx(j,k,i)
	  
          do msc=1,L
              do jsc=1,M
                do isc=1,N
  
                   indx=abs(j-isc)+1
                   indz=abs(k-jsc)+1

                   sumx = sumx + prodMsVDev(isc,jsc,msc)/dvDev(msc)* &
                                              (mx(isc,jsc,msc)*wwmx(indx,indz,msc,i)+ &
                                               my(isc,jsc,msc)*wwmy(indx,indz,msc,i)+ &
                                               mz(isc,jsc,msc)*wwmz(indx,indz,msc,i))

                    ...
                enddo            
           enddo
      enddo
      hx(j,k,i) = sumx
   endif

This kernel works correctly, but only 2 times faster (for system 40x40x1), than CPU code does.

  1. Is there some way to speed up such calculations on GPU?

  2. Why does the version with 2 nested kernels (see my previous question), which works ~20 times faster than the CPU version, return a wrong result?

Elena[/code]

This kernel works correctly, but only 2 times faster (for system 40x40x1), than CPU code does.

  1. Is there some way to speed up such calculations on GPU?

Just to be clear, the problem size is NX=40, NY=40, and NZ=1? That’s seems very small so I wouldn’t expect to see huge speed-ups at this size. If your problem scales to larger sizes, you may want to compare runtimes of the large size to get a better idea of the performance.

If this is the normal size you use, then you’ll want to exploit more parallelism by make the inner loop a parallel reduction. Though, this can be quite challenging when using CUDA, so I’d suggest switching to OpenACC instead where the compiler will do the heavy lifting.

For data management, you can either switch to using OpenACC data directives or keep the CUDA Fortran “device” arrays. Either will work since the PGI compiler will detect if you’re using “device” arrays in an OpenACC compute regions.

You may need to try a few different schedules, but I’d start by collapsing the outer three loops across a gang (block), and the inner three across a vector (thread).

Something like:

!$acc kernels loop gang collapse(3)
        do mtg=1,Ny              
           do jtg=1,Nz  
                do itg=1,Nx 
  
   sumx = hx (itg,jtg,mtg) 

!$acc loop vector collapse(3) reduction(+:sumx)
   do isc=1,Nx 
          do jsc=1,Nz 
              do msc=1,Ny 
                    indx=abs(itg-isc)+1 
                    indz=abs(jtg-jsc)+1 
                    sumx=sumx+ prodMsV(isc,jsc,msc)/dv(msc) * 
     &               (mx(isc,jsc,msc)*wwmx(indx,indz,msc,mtg) 
     &               +my(isc,jsc,msc)*wwmy(indx,indz,msc,mtg) 
     &               +mz(isc,jsc,msc)*wwmz(indx,indz,msc,mtg)) 
... 
                enddo            
              enddo 
            enddo 
            hx (itg,jtg,mtg)=sumx 
          enddo 
        enddo 
      enddo



  1. Why does the version with 2 nested kernels (see my previous question), which works ~20 times faster than the CPU version, return a wrong result?

Probably because you weren’t executing the same amount of work and only summing one element rather than looping through all the elements.

-Mat

Thanks a lot, Mat,

I implemented your code with OpenAcc directives, and it works faster,
but only about 3 times compared to CPU. This seems to be a really poor performance, as we have a system containing 40 x 40 = 1600 cells, and field computation (evaluation of elements of arrays hx, hy hz) on each cell are completely independent.

Now my main question is:
by collapsing the outer three loops across a gang (block), we obtain the threads, which are working independent. Thus, in my understanding, the evaluation of the element hx(i,j,k) can be done fully independent for each i, j and k and thus in parallel to the evaluation of other elements of this array. (All arrays are device arrays)

Does it mean that I can expect the speed up ~ Nx * Ny *Nz = 40 x 40? If not can you please explain:

  1. Why this is not the case and
  2. Does there exist any method which would provide the expected speed up?

This question is of a primary importance for us, as we always deal with systems where we make an independent summation of each particle (cell).

Best regards,
Elena

Hi Elena,

In the initial CUDA Fortran version you were using a total of 1600 threads. Given each streaming multiprocessor (SM) on a Pascal can run 4096 threads simultaneously and that a Pascal has 56 SM, 1600 threads is using a small fraction of the processing units. This is why I said that I wouldn’t expect much speed-up.

In the OpenACC case, we changed this to use 1600 gang (blocks) with a vector length of 128 for a total of 204,000 threads. So here, I would expect better performance.

Have you profiled the code to see where the time is being spent?

If not, try running the code using the PGI profiler, pgprof, or by setting the environment variable “PGI_ACC_TIME=1”. If you can, please post the output so I can help determine where the time is being spent.

Typically, the first thing to look at is the data movement cost. If you are running the compute kernel multiple times and haven’t added a higher level data region, you could be unnecessarily copying data back and forth between the host and device.

The next thing to look at is data layout. Data should be accessed across vectors in the stride-1 dimension, which in this case is the “isc” dimension. This is why I suggested putting “loop vector” on the inner loop. Though since Nx is 40 and the vector length is 128, this why it’s most likely best to collapse the loops. The “wwm” arrays will have poor memory access but at least the “prodMsV”, “mx”, “my”, and “mz” arrays will be ok.

You can also try different schedules for the inner loops. For example something like:

!$acc loop worker(4) reduction(+:sumx) 
   do msc=1,Ny 
       dvtmp = dv(msc)
!$acc loop vector(32) reduction(+:sumx) 
       do isc=1,Nx 
          do jsc=1,Nz 
                    indx=abs(itg-isc)+1 
                    indz=abs(jtg-jsc)+1 
                    sumx=sumx+ prodMsV(isc,jsc,msc)/dvtmp * 
     &               (mx(isc,jsc,msc)*wwmx(indx,indz,msc,mtg) 
     &               +my(isc,jsc,msc)*wwmy(indx,indz,msc,mtg) 
     &               +mz(isc,jsc,msc)*wwmz(indx,indz,msc,mtg)) 
... 
                enddo            
              enddo 
            enddo

The problem here is that since Nx is 40 and the vector length is 32 (for GPUs the vector length will always be a multiple of 32), you’ll process the first 32 elements in parallel, but only 8 in the second pass with 24 threads doing no work. If you’re problem could be resized to Nx=64, then it would take approximately the same amount of time.

-Mat

Hi Mat,

thank you very much for your hints!

If I take Nx = 32, it works really ~4x faster then Nx = 40 on GPU.

  1. Concerning profiling:

I made a profiling of my code. Please find the output under:
https://www.dropbox.com/s/b52eoh9ndng59oc/Profiling_1.jpg?dl=0

and zoomed ones
https://www.dropbox.com/s/pjmhq5085c4itiv/Profiling_2.jpg?dl=0

https://www.dropbox.com/s/z8hfvqfm8600yx7/Profiling_3.jpg?dl=0

It seems that there is no data movement between host and device during the kernel calculation (subroutine comp_hdem_dirsum). Is this correct?
(My program calls many times this kernel). It takes about 93% computation time.

  1. My next important question is the following:

You write, that the arrays “wwm” will have the poor memory access. So, how can I change these arrays to have a better memory access? If I make them 1 dimensional, will it be better?
Will it help, if I rewrite “mx”, “my” and “mz” arrays also 1 dim?

  1. And my last issue:
    The alternative schedule for the inner loop, which you have suggested, returns the wrong result:

!$acc loop worker(4) reduction(+:sumx)
do msc=1,Ny
dvtmp = dv(msc)
!$acc loop vector(32) reduction(+:sumx)
do isc=1,Nx
do jsc=1,Nz
indx=abs(itg-isc)+1
indz=abs(jtg-jsc)+1
sumx=sumx+ prodMsV(isc,jsc,msc)/dvtmp *
& (mx(isc,jsc,msc)*wwmx(indx,indz,msc,mtg)
& +my(isc,jsc,msc)*wwmy(indx,indz,msc,mtg)
& +mz(isc,jsc,msc)*wwmz(indx,indz,msc,mtg))

enddo
enddo
enddo

>

Why?

Best regards,
Elena

Hi Elena,

Apologies for missing your follow-up question.

It seems that there is no data movement between host and device during the kernel calculation (subroutine comp_hdem_dirsum). Is this correct?

That looks correct to me. There is some data movement but it looks to be after your compute regions with reductions.

You write, that the arrays “wwm” will have the poor memory access. So, how can I change these arrays to have a better memory access? If I make them 1 dimensional, will it be better?
Will it help, if I rewrite “mx”, “my” and “mz” arrays also 1 dim?

On second look, I might take that back. So long as each of the “isc” loop’s “indx” is consecutive, then access across the “wwm” arrays should be ok.

Using 1-D arrays probably wont help here and make it harder for the compiler to vectorize since you’d need to use computed indices.

The alternative schedule for the inner loop, which you have suggested, returns the wrong result:

Unfortunately, I’m not sure. Though typically wrong answers are caused by a missing update where the device and host copies of a variable are out of sync. Can you share a reproducing example of the issue?

-Mat

Hi Mat,

thank you for your response!


Now I have a very important question.
In my code:

!$acc kernels loop gang collapse(3) 
        do mtg=1,Ny              
           do jtg=1,Nz  
                do itg=1,Nx 
  
   sumx = hx (itg,jtg,mtg) 

!$acc loop vector collapse(3) reduction(+:sumx) 
   do msc=1,Ny 
          do jsc=1,Nz 
              do isc=1,Nx 
                    indx=abs(itg-isc)+1 
                    indz=abs(jtg-jsc)+1 
                    sumx=sumx+ prodMsV(isc,jsc,msc)/dv(msc) * 
     &               (mx(isc,jsc,msc)*wwmx(indx,indz,msc,mtg) 
     &               +my(isc,jsc,msc)*wwmy(indx,indz,msc,mtg) 
     &               +mz(isc,jsc,msc)*wwmz(indx,indz,msc,mtg)) 
... 
                enddo            
              enddo 
            enddo 
            hx (itg,jtg,mtg)=sumx 
          enddo 
        enddo 
      enddo

the evaluation of each array element hx(itg,jtg,mtg) is performed independently on other elements of this array.
For this reason I would expect a very large accelaration compared to CPU taken into account a large number of microcores on my GPU (GTX 770) device (as far as I understand the evaluation of all array elements should be performed simultaneously).

However in reality the acceleration compared to 1 core CPU is only 2,5 times for the system 32 x 32 x 1. ( For 64 x 64 x 1 ~ 3,5 times.)

What is the reason for such a small acceleration?
Is it possible to achieve a much higher acceleration?

Best regards,

Elena

What is the reason for such a small acceleration?
Is it possible to achieve a much higher acceleration?

Without a reproducing example, I can’t really say. Though what I’d suggest you do is using the PGI profile, pgprof, to see where the time is occurring. Is it in the compute regions or is data movement dominating the time?

Also, what are the actual run times? There is some initialization and overhead cost that could dominate the overall time if the runtime is very short.

-Mat

Hi Mat,

I made the reproducing example, but I have problems to post it here. Maybe is it too long? How can I do it?

Can I make it per email?

Best regards,
Elena

Thanks Elena. Please send the example source to PGI Customer Service (trs@pgroup.com) and ask them to forward it to me.

-Mat

Mat,

What was the eventual resolution to this ticket? Did Elena get her speedup?

Dan

Hi Dan,

Did Elena get her speedup?

I’m not sure. I looked through my old email from December 2017 and don’t see that Elena sent in any code. The only email I have from her is from 2015. I could have not saved it, but I typically save all emails from users. Granted, the Customer Support folks could have helped her directly instead of forwarding her note to me.

Do you have a similar issue?

-Mat