Not able to launch kernels in async, OpenACC

I am to trying run multiple loops in parallel. I used async clause next to the compute directive but the processes are not launching in async. I have confirmed that by looking at the nsys profile generated. I am also attaching the .qdrep file for reference. I am not able to figure out what mistake I am doing? Did I use the async clause in the correct location?

those four segments should have all started at once.
image

Code:

PROGRAM Test

	use declare_variables
	implicit none
	
	call CPU_TIME(start_time)
	call ALLOCATE_VARIABLES()

	! ------------------------------
	! DO WHILE (time < t_end)
	DO WHILE (iter < 3)
	
		call Compute_Prims()		
		call Compute_Conservs()
		
		!$acc wait
		
		time = time + time_step
		iter = iter + 1
		
		call CPU_TIME(end_time)
		print*, time, end_time-start_time
	ENDDO
	! ------------------------------
	
	call CPU_TIME(end_time)
	print*, 'Total wall clock time taken = ', end_time-start_time, 'secs'

END


SUBROUTINE ALLOCATE_VARIABLES()


	use declare_variables
	implicit none	
	
	nblocks = 1
	ALLOCATE(NI(nblocks))
	ALLOCATE(NJ(nblocks))
	ALLOCATE(NK(nblocks))
	
	NI = 64
	NJ = 64
	NK = 64
	nprims = 5
	nconserv = 5
	time_Step = 0.001d0
	t_end = 0.5d0
	
	NImax = MAXVAL(NI)
	NJmax = MAXVAL(NJ)
	NKmax = MAXVAL(NK)
	
	ALLOCATE(Px(NImax,NJmax,NKmax,nblocks,nprims))
	ALLOCATE(Py(NImax,NJmax,NKmax,nblocks,nprims))
	ALLOCATE(Pz(NImax,NJmax,NKmax,nblocks,nprims))
	
	ALLOCATE(Cx(NImax,NJmax,NKmax,nblocks,nconserv))
	ALLOCATE(Cy(NImax,NJmax,NKmax,nblocks,nconserv))
	ALLOCATE(Cz(NImax,NJmax,NKmax,nblocks,nconserv))
	
!$acc enter data copyin(NI,NJ,NK), create(Px,Py,Pz,Cx,Cy,Cz)

END


SUBROUTINE Compute_Prims()


	use declare_variables
	implicit none
    
	integer :: queue
	
	!$acc parallel loop gang vector collapse(4) async(1)
	DO nbl = 1,nblocks
    DO k = 1, NKmax
    DO j = 1, NJmax
    DO i = 1, NImax
		if (k.le.NK(nbl).and.j.le.NJ(nbl).and.i.le.NI(nbl)) then
			Px(i,j,k,nbl,1) = i*j
			
			Px(i,j,k,nbl,2) = j*k
			
			Px(i,j,k,nbl,3) = k*i
		
		endif
    ENDDO
    ENDDO
    ENDDO
    ENDDO
	
	!$acc parallel loop gang vector collapse(4) async(2)
	DO nbl = 1,nblocks
    DO k = 1, NKmax
    DO j = 1, NJmax
    DO i = 1, NImax
		if (k.le.NK(nbl).and.j.le.NJ(nbl).and.i.le.NI(nbl)) then
			
			Px(i,j,k,nbl,4) = i*i
		
		endif
    ENDDO
    ENDDO
    ENDDO
    ENDDO
	
	!$acc parallel loop gang vector collapse(4) async(3)
	DO nbl = 1,nblocks
    DO k = 1, NKmax
    DO j = 1, NJmax
    DO i = 1, NImax
		if (k.le.NK(nbl).and.j.le.NJ(nbl).and.i.le.NI(nbl)) then
			Px(i,j,k,nbl,5) = j*j
		
		endif
    ENDDO
    ENDDO
    ENDDO
    ENDDO

END


SUBROUTINE Compute_Conservs()


	use declare_variables
	implicit none
	
	integer :: queue
	
	!$acc parallel loop gang vector collapse(4) async(4)
	DO nbl = 1,nblocks
    DO k = 1, NKmax
    DO j = 1, NJmax
    DO i = 1, NImax
		if (k.le.NK(nbl).and.j.le.NJ(nbl).and.i.le.NI(nbl)) then
		Cx(i,j,k,nbl,1) = i*j
		                  
		Cx(i,j,k,nbl,2) = j*k
		                  
		Cx(i,j,k,nbl,3) = k*i
		                  
		Cx(i,j,k,nbl,4) = i*i
		                  
		Cx(i,j,k,nbl,5) = j*j
		endif
	ENDDO
	ENDDO
	ENDDO
	ENDDO

END

I am also attaching the full code (both test_Async.f90 and Module.f90 should be run). Thanks.

Module.f90 (745 Bytes)
test_async.f90 (2.8 KB)
Makefile (514 Bytes)test_async1.qdrep (382.2 KB)

They look ok to me. Though keep in mind that if a kernel is using all resource on a device, the next kernel can’t start until the first one starts to free up resources. Also, these loops are doing almost no work so it could be taking longer to launch the kernels than for the kernels to execute.

I normally would not recommend you do this, but just to illustrate, add “num_gangs(16)” to each of your compute regions. so that each kernel is only using 1 SM and takes longer to finish. In this case, you’ll see that the kernels are overlapped in your profile.

In general, ‘async’ is best used:

  • to overlap data movement and compute with multiple queues, one for data and one for compute,
  • to hide kernel launch latency (using just ‘async’, no queue id). The host can launch more kernels while another is running. Though this only works if the kernel takes longer than the time to launch the kernel.

The only time running concurrent kernels on multiple async queues really works well is when the number of gangs is relatively small and the kernels under-utilize the device. Hence adding more concurrent kernels can then use those resources. But it’s typically better to have each kernel be able to fully utilize the GPU and only use this method if the algorithm doesn’t allow for it.

Also consider that there is overhead in creating async queues (CUDA streams), hence you don’t want to use too many and you want to re-use them.

I am really surprised to note that while a kernel can launch with NImax = NJmax=NKmax = 128 (or 256), two separate compute regions with NImax = NJmax=NKmax = 64 each can’t be launched asynchronously. My point is if GPU is having enough resources to launch a kernel with high NImax = NJmax=NKmax = 128, why can’t it launch two kernels of NImax = NJmax=NKmax = 64 in parallel. Also, how do I acess if a kernel that I have launched is fully utilizing the GPU resources? Is there a Gang limit for full GPU occupancy? I would like to learn about that to fully utilize GPU for optimizing my code.

That are several factors which impact GPU utilization, but let’s presume these kernels can achieve 100% occupancy.

On a V100, each SM can run a maximum of 2048 concurrent threads or 16 blocks (gangs) at 128 threads per block. There are total of 80 SMs, so a max of 163,840 threads.

At 64x64x64, you have 262,144 total iterations. Assuming that each iteration is being handled by a single thread (which isn’t quite the case here since the compiler is using 2048 blocks each with 128 threads per block), this is more than enough iterations to fully utilize the GPU. In other words, at 64, the code is fully utilizing the GPU.

My point is if GPU is having enough resources to launch a kernel with high NImax = NJmax=NKmax = 128, why can’t it launch two kernels of NImax = NJmax=NKmax = 64 in parallel.

Perhaps the misunderstand is that you didn’t realize that there can have far more threads then available hardware resources and not all threads need to be active at the same time? In this case, there are waves of active threads so at 128, it’s just going to do more waves.

Until you go even smaller, like 32, and are no longer fully using the hardware will there be available SMs that can execute a different kernel. Or in the 64 case, SMs will start freeing up during the second wave so you might see a bit of overlap there. Though again, these kernels are so short, that the second kernel may not even be finished launching before the first gets to this point.

Is there a Gang limit for full GPU occupancy?

Not sure what you mean. The limit on the number of blocks (gangs) is 2,147,483,647 (not including the 64Kx64k in the y and z dimensions which the compiler doesn’t normally use) but the compiler usually limits the gangs to 64K and then has each gang do multiple iterations of a loop. 64K blocks is plenty.

Again, for a theoretical occupancy of 100% on a V100, you need 163,840 threads, or 2048 blocks (gangs) with 128 threads per block. Though you could also have 1024 blocks with 256 threads, 512 with 512, 256 with 1024. But with a max of 16 blocks per SM, 4096 blocks with 64 threads would only achieve 50% occupancy. On a A100, 32 blocks can be run on an SM, but is still limited to a max of 2048 threads per SM, so 4096x64 would be back to 100%.

There are other factors that effect occupancy, such register and shared memory usage, so I’m simplifying things.

A useful tools is the CUDA Occupancy Calculator (CUDA Occupancy Calculator :: CUDA Toolkit Documentation) which can help determine the theoretical occupancy.

To see what the achieved occupancy is, you’ll want to use Nsight-Compute. If the achieved occupancy is lower than the theoretical, it typically means that the warps are stalling for some reason, like waiting for memory or there’s contention of the FP Units.

1 Like

Thank you, this is exactly the explanation that I needed.

Yah, I thought there will be huge number of hardware resources available in GPU (far greater than 64^3)

Can I do the same with Nsight Systems?

In the NImax = NJmax = NKmax = 64 case, If I use 64*64 gangs (4096 locks) and 64 vectors I will not have full GPU occupancy isn’it? So I need to add one worker level I suppose. Something like this?

!$acc parallel loop gang num_worker(64)
    DO k = 1, NKmax    !---> NKmax = 64
!$acc loop worker
    DO j = 1, NJmax    !---> NJmax = 64
!$acc loop vector      !---> No. of vectors = NImax = 64?
    DO i = 1, NImax     ---> NImax = 64

(i is the stride-1 dimension)

No, for metrics on individual kernels, you’ll need to use Nsight-Compute.

So I need to add one worker level I suppose. Something like this?

Worker maps to the ‘y’ dimension of a thread block and ‘vector’ maps to the ‘x’ dimension. So here you’re using a 64x64 thread block size (4096) which far exceeds the maximum of 1024. So num_workers should be no more are 16.

The other issue is now there are only 64 gangs, so assuming a block size of 1024, max of 2048 threads per SM, you’re only using 32 of the 80 SMs.

I’d probably do something like this instead:

$acc parallel loop gang worker collapse(2) num_worker(2)
    DO k = 1, NKmax    !---> NKmax = 64
    DO j = 1, NJmax    !---> NJmax = 64
!$acc loop vector      !---> No. of vectors = NImax = 64?
    DO i = 1, NImax     ---> NImax = 64

So now the block size is 128, ‘j’ will be strip-mined by 2, with 2048 gangs (64 x 32). Now your back to 100% occupancy.

Again this is a very simplistic view. There other factors which will effect performance so this may or may not be beneficial.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.