Not able to launch kernels in async, OpenACC

hemanthgrylls · August 4, 2021, 7:02pm

I am to trying run multiple loops in parallel. I used async clause next to the compute directive but the processes are not launching in async. I have confirmed that by looking at the nsys profile generated. I am also attaching the .qdrep file for reference. I am not able to figure out what mistake I am doing? Did I use the async clause in the correct location?

those four segments should have all started at once.

Code:

PROGRAM Test

	use declare_variables
	implicit none
	
	call CPU_TIME(start_time)
	call ALLOCATE_VARIABLES()

	! ------------------------------
	! DO WHILE (time < t_end)
	DO WHILE (iter < 3)
	
		call Compute_Prims()		
		call Compute_Conservs()
		
		!$acc wait
		
		time = time + time_step
		iter = iter + 1
		
		call CPU_TIME(end_time)
		print*, time, end_time-start_time
	ENDDO
	! ------------------------------
	
	call CPU_TIME(end_time)
	print*, 'Total wall clock time taken = ', end_time-start_time, 'secs'

END


SUBROUTINE ALLOCATE_VARIABLES()


	use declare_variables
	implicit none	
	
	nblocks = 1
	ALLOCATE(NI(nblocks))
	ALLOCATE(NJ(nblocks))
	ALLOCATE(NK(nblocks))
	
	NI = 64
	NJ = 64
	NK = 64
	nprims = 5
	nconserv = 5
	time_Step = 0.001d0
	t_end = 0.5d0
	
	NImax = MAXVAL(NI)
	NJmax = MAXVAL(NJ)
	NKmax = MAXVAL(NK)
	
	ALLOCATE(Px(NImax,NJmax,NKmax,nblocks,nprims))
	ALLOCATE(Py(NImax,NJmax,NKmax,nblocks,nprims))
	ALLOCATE(Pz(NImax,NJmax,NKmax,nblocks,nprims))
	
	ALLOCATE(Cx(NImax,NJmax,NKmax,nblocks,nconserv))
	ALLOCATE(Cy(NImax,NJmax,NKmax,nblocks,nconserv))
	ALLOCATE(Cz(NImax,NJmax,NKmax,nblocks,nconserv))
	
!$acc enter data copyin(NI,NJ,NK), create(Px,Py,Pz,Cx,Cy,Cz)

END


SUBROUTINE Compute_Prims()


	use declare_variables
	implicit none
    
	integer :: queue
	
	!$acc parallel loop gang vector collapse(4) async(1)
	DO nbl = 1,nblocks
    DO k = 1, NKmax
    DO j = 1, NJmax
    DO i = 1, NImax
		if (k.le.NK(nbl).and.j.le.NJ(nbl).and.i.le.NI(nbl)) then
			Px(i,j,k,nbl,1) = i*j
			
			Px(i,j,k,nbl,2) = j*k
			
			Px(i,j,k,nbl,3) = k*i
		
		endif
    ENDDO
    ENDDO
    ENDDO
    ENDDO
	
	!$acc parallel loop gang vector collapse(4) async(2)
	DO nbl = 1,nblocks
    DO k = 1, NKmax
    DO j = 1, NJmax
    DO i = 1, NImax
		if (k.le.NK(nbl).and.j.le.NJ(nbl).and.i.le.NI(nbl)) then
			
			Px(i,j,k,nbl,4) = i*i
		
		endif
    ENDDO
    ENDDO
    ENDDO
    ENDDO
	
	!$acc parallel loop gang vector collapse(4) async(3)
	DO nbl = 1,nblocks
    DO k = 1, NKmax
    DO j = 1, NJmax
    DO i = 1, NImax
		if (k.le.NK(nbl).and.j.le.NJ(nbl).and.i.le.NI(nbl)) then
			Px(i,j,k,nbl,5) = j*j
		
		endif
    ENDDO
    ENDDO
    ENDDO
    ENDDO

END


SUBROUTINE Compute_Conservs()


	use declare_variables
	implicit none
	
	integer :: queue
	
	!$acc parallel loop gang vector collapse(4) async(4)
	DO nbl = 1,nblocks
    DO k = 1, NKmax
    DO j = 1, NJmax
    DO i = 1, NImax
		if (k.le.NK(nbl).and.j.le.NJ(nbl).and.i.le.NI(nbl)) then
		Cx(i,j,k,nbl,1) = i*j
		                  
		Cx(i,j,k,nbl,2) = j*k
		                  
		Cx(i,j,k,nbl,3) = k*i
		                  
		Cx(i,j,k,nbl,4) = i*i
		                  
		Cx(i,j,k,nbl,5) = j*j
		endif
	ENDDO
	ENDDO
	ENDDO
	ENDDO

END

I am also attaching the full code (both test_Async.f90 and Module.f90 should be run). Thanks.

Module.f90 (745 Bytes)
test_async.f90 (2.8 KB)
Makefile (514 Bytes)test_async1.qdrep (382.2 KB)

MatColgrove · August 4, 2021, 8:51pm

They look ok to me. Though keep in mind that if a kernel is using all resource on a device, the next kernel can’t start until the first one starts to free up resources. Also, these loops are doing almost no work so it could be taking longer to launch the kernels than for the kernels to execute.

I normally would not recommend you do this, but just to illustrate, add “num_gangs(16)” to each of your compute regions. so that each kernel is only using 1 SM and takes longer to finish. In this case, you’ll see that the kernels are overlapped in your profile.

In general, ‘async’ is best used:

to overlap data movement and compute with multiple queues, one for data and one for compute,
to hide kernel launch latency (using just ‘async’, no queue id). The host can launch more kernels while another is running. Though this only works if the kernel takes longer than the time to launch the kernel.

The only time running concurrent kernels on multiple async queues really works well is when the number of gangs is relatively small and the kernels under-utilize the device. Hence adding more concurrent kernels can then use those resources. But it’s typically better to have each kernel be able to fully utilize the GPU and only use this method if the algorithm doesn’t allow for it.

Also consider that there is overhead in creating async queues (CUDA streams), hence you don’t want to use too many and you want to re-use them.

hemanthgrylls · August 4, 2021, 9:35pm

I am really surprised to note that while a kernel can launch with NImax = NJmax=NKmax = 128 (or 256), two separate compute regions with NImax = NJmax=NKmax = 64 each can’t be launched asynchronously. My point is if GPU is having enough resources to launch a kernel with high NImax = NJmax=NKmax = 128, why can’t it launch two kernels of NImax = NJmax=NKmax = 64 in parallel. Also, how do I acess if a kernel that I have launched is fully utilizing the GPU resources? Is there a Gang limit for full GPU occupancy? I would like to learn about that to fully utilize GPU for optimizing my code.

MatColgrove · August 4, 2021, 11:20pm

That are several factors which impact GPU utilization, but let’s presume these kernels can achieve 100% occupancy.

On a V100, each SM can run a maximum of 2048 concurrent threads or 16 blocks (gangs) at 128 threads per block. There are total of 80 SMs, so a max of 163,840 threads.

At 64x64x64, you have 262,144 total iterations. Assuming that each iteration is being handled by a single thread (which isn’t quite the case here since the compiler is using 2048 blocks each with 128 threads per block), this is more than enough iterations to fully utilize the GPU. In other words, at 64, the code is fully utilizing the GPU.

My point is if GPU is having enough resources to launch a kernel with high NImax = NJmax=NKmax = 128, why can’t it launch two kernels of NImax = NJmax=NKmax = 64 in parallel.

Perhaps the misunderstand is that you didn’t realize that there can have far more threads then available hardware resources and not all threads need to be active at the same time? In this case, there are waves of active threads so at 128, it’s just going to do more waves.

Until you go even smaller, like 32, and are no longer fully using the hardware will there be available SMs that can execute a different kernel. Or in the 64 case, SMs will start freeing up during the second wave so you might see a bit of overlap there. Though again, these kernels are so short, that the second kernel may not even be finished launching before the first gets to this point.

Is there a Gang limit for full GPU occupancy?

Not sure what you mean. The limit on the number of blocks (gangs) is 2,147,483,647 (not including the 64Kx64k in the y and z dimensions which the compiler doesn’t normally use) but the compiler usually limits the gangs to 64K and then has each gang do multiple iterations of a loop. 64K blocks is plenty.

Again, for a theoretical occupancy of 100% on a V100, you need 163,840 threads, or 2048 blocks (gangs) with 128 threads per block. Though you could also have 1024 blocks with 256 threads, 512 with 512, 256 with 1024. But with a max of 16 blocks per SM, 4096 blocks with 64 threads would only achieve 50% occupancy. On a A100, 32 blocks can be run on an SM, but is still limited to a max of 2048 threads per SM, so 4096x64 would be back to 100%.

There are other factors that effect occupancy, such register and shared memory usage, so I’m simplifying things.

A useful tools is the CUDA Occupancy Calculator (https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html) which can help determine the theoretical occupancy.

To see what the achieved occupancy is, you’ll want to use Nsight-Compute. If the achieved occupancy is lower than the theoretical, it typically means that the warps are stalling for some reason, like waiting for memory or there’s contention of the FP Units.

hemanthgrylls · August 5, 2021, 8:30am

Thank you, this is exactly the explanation that I needed.

Yah, I thought there will be huge number of hardware resources available in GPU (far greater than 64^3)

Can I do the same with Nsight Systems?

hemanthgrylls · August 5, 2021, 2:51pm

In the NImax = NJmax = NKmax = 64 case, If I use 64*64 gangs (4096 locks) and 64 vectors I will not have full GPU occupancy isn’it? So I need to add one worker level I suppose. Something like this?

!$acc parallel loop gang num_worker(64)
    DO k = 1, NKmax    !---> NKmax = 64
!$acc loop worker
    DO j = 1, NJmax    !---> NJmax = 64
!$acc loop vector      !---> No. of vectors = NImax = 64?
    DO i = 1, NImax     ---> NImax = 64

(i is the stride-1 dimension)

MatColgrove · August 5, 2021, 5:38pm

No, for metrics on individual kernels, you’ll need to use Nsight-Compute.

So I need to add one worker level I suppose. Something like this?

Worker maps to the ‘y’ dimension of a thread block and ‘vector’ maps to the ‘x’ dimension. So here you’re using a 64x64 thread block size (4096) which far exceeds the maximum of 1024. So num_workers should be no more are 16.

The other issue is now there are only 64 gangs, so assuming a block size of 1024, max of 2048 threads per SM, you’re only using 32 of the 80 SMs.

I’d probably do something like this instead:

$acc parallel loop gang worker collapse(2) num_worker(2)
    DO k = 1, NKmax    !---> NKmax = 64
    DO j = 1, NJmax    !---> NJmax = 64
!$acc loop vector      !---> No. of vectors = NImax = 64?
    DO i = 1, NImax     ---> NImax = 64

So now the block size is 128, ‘j’ will be strip-mined by 2, with 2048 gangs (64 x 32). Now your back to 100% occupancy.

Again this is a very simplistic view. There other factors which will effect performance so this may or may not be beneficial.

system · October 4, 2021, 5:39pm

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
OpenACC async max number of streams Legacy PGI Compilers	0	4379	May 2, 2014
OpenACC "pgaccelinfo" output: meaning of Async Engines nvc, nvc++ and nvfortran	5	576	November 30, 2022
multiGPU on same node - issue with the async call Legacy PGI Compilers	8	3424	January 20, 2022
Launching multiple subroutines to in parallel nvc, nvc++ and nvfortran	1	385	August 3, 2021
Combining stdpar with OpenACC async nvc, nvc++ and nvfortran	1	449	April 27, 2023
OpenACC async problem when using PGI compiler v13.9 or v14.1 Legacy PGI Compilers	3	5570	February 4, 2014
OpenACC: Best way to parallelize nested DO loops with data dependency between loops? nvc, nvc++ and nvfortran	14	3673	October 4, 2021
Question about CUDA kernels parallel execution CUDA Programming and Performance cuda , parallel-computing	7	2957	April 27, 2024
have cpu & gpu computing concurrently? Legacy PGI Compilers	1	1738	January 15, 2013
Oddity in OpenACC Legacy PGI Compilers	15	13124	November 23, 2015

Not able to launch kernels in async, OpenACC

Related topics