NVFORTRAN-F-0000-Internal compiler error. gen_llvm_expr(): no incoming ili 0

Hello,
I have a need that uses multi-GPU to calculte the simulation. So, I am trying to do it using OpenMP and OpenAcc. But, there is the error “NVFORTRAN-F-0000-Internal compiler error. gen_llvm_expr(): no incoming ili 0”, when I switch off the annotation of “!$omp”. The codes work well when I cancel the annotation of it.

Here is the part of original codes:

PROGRAM MAIN
	USE CONSTANTPARAMETERS
	USE ELECTROMAGNETIC_VARIABLES
	USE RES_MODEL_PARAMETER
	USE TIME_PARAMETERS
	use openacc
	use omp_lib
	use cudafor
	IMPLICIT none
	
	INTEGER(KIND = 4) :: numblocks, tid, istart, iend, k, j, i, N, blocks
	REAL(KIND=8) :: a(3), b, c, d,f(200,200,200)
	
	N = 0
	a = 0
	c = 0
	d = 1
	f=0
	istat = cudaGetDeviceCount( numblocks )
	call acc_init(acc_device_nvidia)
	write(*,*) numblocks

	! numblocks = acc_get_num_devices(acc_device_nvidia)
	! print*, "numblocs =", numblocks
	
	!!$omp parallel do num_threads(numblocks) private(b,c,N,tid)
	do blocks = 0, 1
		b = d + blocks

		! istat = cudaSetDevice(0)
		
		tid = omp_get_thread_num()
		call acc_set_device_num(1,acc_device_nvidia)
		!$acc data create(f) copyin(tid) copy(N)
		
		!$acc kernels loop
		do k = 1, 3
			N = N + tid
		enddo
		!$acc end kernels
		
		! !$acc kernels
		! 	c = a(1)+a(2)+a(3)
		! !$acc end kernels
		!!$acc update host(N)

		!$acc end data
		c = N + b

		write(*,*)"The device is :",tid,"The Result is :",blocks,b,N,c
	enddo
	!!$omp end parallel do

	stop
end program main

Hi SkyCool,

It’s a generic compiler issue when generating the intermediate code. Can you post the full code (i.e. the modules), so I can try to reproduce the error and report it?

Also, what compiler version are you using and what flags are you using to compile?

-Mat

1 Like

I experimented a bit and found that the modules aren’t used, so was able to compile the code after commenting these out.

I was able to recreate the error with our 24.3 release, but it seems to have been fixed in 24.5. I do see a report for a similar error that was fixed in 24.5, so your issue is likely related.

Please update your compiler version to 24.5 or later. Our current version is 24.11.

-Mat

Hi Mat,
The following codes also generate the same errors.
I have NVFORTRAN 23.3 and CUDA 12.6.
The commands in the makefile are: -mp -fast -acc -gpu=cc70 -cudalib=cusolver -cudalib=cusparse -cuda -Minfo=all -c

PROGRAM MAIN
! USE CONSTANTPARAMETERS
! USE ELECTROMAGNETIC_VARIABLES
! USE RES_MODEL_PARAMETER
! USE TIME_PARAMETERS
use openacc
use omp_lib
use cudafor
IMPLICIT none

INTEGER(KIND = 4) :: numblocks, tid, istart, iend, k, j, i, N, blocks
REAL(KIND=8) :: a(3), b, c, d,f(200,200,200)
INTEGER(KIND = 4) :: istat
N = 0
a = 0
c = 0
d = 1
f=0
istat = cudaGetDeviceCount( numblocks )
! call acc_init(acc_device_nvidia)
write(*,*) numblocks

! numblocks = acc_get_num_devices(acc_device_nvidia)
! print*, "numblocs =", numblocks

!!$omp parallel do num_threads(numblocks) private(b,c,N,tid)
do blocks = 1, 10
	b = d + blocks

	istat = cudaSetDevice(0)
	
	! tid = omp_get_thread_num()
	! call acc_set_device_num(1,acc_device_nvidia)
	!$acc data create(f) copyin(tid) copy(N)
	
	!$acc kernels loop 
	do k = 1, 3
		N = N + tid
	enddo
	!$acc end kernels
	
	!!$acc end data
	c = N + b

	!$acc kernels 
		f(1,1,1) = 100
	!$acc end kernels
	!$acc end data	
	write(*,*) f(1,1,1)
	write(*,*)"The device is :",tid,"The Result is :",blocks,b,N,c


	b = d + blocks

	istat = cudaSetDevice(1)
	
	!$acc data create(f) copyin(tid) copy(N)
	
	!$acc kernels loop
	do k = 1, 3
		N = N + tid
	enddo
	!$acc end kernels

	!$acc end data
	c = N + b

	write(*,*)"The device is :",tid,"The Result is :",blocks,b,N,c

enddo
!!$omp end parallel do

stop

end program main

  • Amors

Yes, sorry, I figured out that the modules aren’t actually used after I made the first post. See my second post where I found that the error was fixed in our 24.5 release.

Hi Mat,

Thanks for a really prompt reply.
I will try to update my compiler. On the other hand, I am struggling with how to use multi-GPU computing. Maybe, the program have four stages:

  1. Build the variables in the host.
  2. Calculate on the GPU_0, then update the results to the host from GPU_0
  3. Calculate on the GPU_0, GPU_1, and GPU_2. Then update the results to the host from all GPUs.
  4. Calculate the results on the host.

In the third stage, I need to use “data/end data and enter data/exit data”. I don’t know how should I copyin the variables I need. I need to use “do/enddo”? as follow

	!$omp parallel do num_threads(num_nvidia) 
		DO point=1,2
			tid = omp_get_thread_num()
			call acc_set_device_num(tid,acc_device_nvidia)
		
			!$acc enter data copyin(CDELX,CDELY,CDELZ,delt,SIGMA_Jac_Coord,SIGMA_Jac,CCSIG_Jac,Is_EX_In_Source,Is_EY_In_Source,source)&
			!$acc      copyin(EX_1,EY_1,EZ_1,EX_V_1,EY_V_1,EZ_V_1)&
			!$acc	   copyin(Forward_fields,Jacobian_Tran)&
			!$acc      copyin(Timegate_Rec,Field_Orig_Pos,Field_Coord,Points_Observer,Timegate_Para,Data_Rec,Comput_Logic)&
			!$acc      copyin(Field_Rec,Data_Calcuated,CCSIG,CA2,CB,Ctime_Cutoff)&
			!$acc      copyin(Data_Measured,Dataerr_Variance,Field_Idx_Rec,Ratio_Rec,Field_Idx1_Rec,Field_False_Rec,Comput_False_Logic)&	
			!$acc      create(Field_Storage,Field_Storage_Idxnum,Field_Storage_Num)&
			!$acc 	   copyin(A_EX,B_EX,C_EX,D_EX,A_EY,B_EY,C_EY,D_EY,A_EZ,B_EZ,C_EZ,D_EZ)&
			!$acc	   copyin(A_EX_1,B_EX_1,C_EX_1,D_EX_1,A_EY_1,B_EY_1,C_EY_1,D_EY_1,A_EZ_1,B_EZ_1,C_EZ_1,D_EZ_1)&
			!$acc      create(C_EX_1C,C_EY_1C,C_EZ_1C,C_EXC,C_EYC,C_EZC)&
			!$acc      create(buffer_EX_1,buffer_EY_1,buffer_EZ_1,buffer_EX,buffer_EY,buffer_EZ)
		ENDDO
		!$omp end parallel do

and

!$omp parallel do num_threads(num_nvidia) 
DO point=1,Point_Num
	tid = omp_get_thread_num()
	call acc_set_device_num(tid,acc_device_nvidia)
	!$acc exit data delete(CDELX,CDELY,CDELZ,delt(1:NSTOP),SIGMA_Jac_Coord,SIGMA_Jac,CCSIG_Jac,Is_EX_In_Source,Is_EY_In_Source,source)&
	!$acc      delete(EX,EY,EZ,HX,HY,HZ,EX_1,EY_1,EZ_1)&
	!$acc      delete(Timegate_Rec,Field_Orig_Pos,Field_Coord,Points_Observer,Timegate_Para,Data_Rec,Comput_Logic(1:NSTOP))&
	!$acc      delete(Field_Rec,Data_Calcuated,CCSIG,CA2,CB,Ctime_Cutoff)&
	!$acc      delete(EX_V,EY_V,EZ_V,HX_V,HY_V,HZ_V,EX_V_1,EY_V_1,EZ_V_1)&
	!$acc      delete(Data_Measured,Dataerr_Variance,Field_Idx_Rec,Ratio_Rec,Field_Idx1_Rec,Field_False_Rec,Comput_False_Logic(1:NSTOP))&
	!$acc      delete(Field_Storage,Field_Storage_idxnum,Field_Storage_Num)&
	!$acc      delete(A_EX,B_EX,C_EX,D_EX,A_EY,B_EY,C_EY,D_EY,A_EZ,B_EZ,C_EZ,D_EZ)&
	!$acc	   delete(A_EX_1,B_EX_1,C_EX_1,D_EX_1,A_EY_1,B_EY_1,C_EY_1,D_EY_1,A_EZ_1,B_EZ_1,C_EZ_1,D_EZ_1)
ENDDO
!$omp end parallel do

Do you have the manual or some examples that show how to use multi-GPU with OpenMP and OpenAcc or CUDA Fortran.

Personally I always recommend using MPI+OpenACC for multi-gpu programming. You then have a one-to-one relationship between a rank and a GPU. When using host threads, it becomes a many-to-many relationship so trying to get the memory right between all the threads and GPUs is a challenge. You end up doing domain decomposition which isn’t very natural in OpenMP, but is in MPI. I figure if I’m going to do all the extra work for domain decomposition, I might as well just use MPI and then get the advantage of being able to scale to an arbitrary number of nodes and GPUs.

Plus with MPI, you can use CUDA Aware MPI, so data is directly transferred between the GPUs so doesn’t need to be brought back to the host.

I know of folks who’ve used host threads, and I’ve used them myself (though it’s been over10 years), but in those cases they’re typically limiting themselves to a few GPUs.

For MPI+OpenACC, there’s lots of examples and tutorials out there. Just do a web search.

I wrote this one awhile ago and it needs updating, but gives the basic idea: Using OpenACC with MPI Tutorial — Using OpenACC with MPI Tutorial 25.1 documentation

Ron Caplan’s POT3d code recently switch from OpenACC using pure Fortran Standard Language Parallelism (DO CONNCURRENT), but he has a nice set of test codes which are good examples

I’m one of the lead developers of the SPEChpc 2021 benchmark suite which uses various models, pure-MPI, MPI+OpenACC, MPI+OpenMP (host), MPI+OpenMP offload. It’s licensed so not directly downloadable, but academic institutions can apply for a free license. Though there is a fee for commercial use.

In the third stage, I need to use “data/end data and enter data/exit data”. I don’t know how should I copyin the variables I need. I need to use “do/enddo”? as follow

To actually answer you’re question, yes, you need to add these types of loops where you index over the number of GPUs resetting which device your using each time. Each device needs it’s own copy of all the shared data, and individual copies of any decomposed data. You also need to take care when copying back data to the host, since you need to make sure you aren’t overwriting one GPUs results with another’s. Hence why the data should be decomposed so each is only working on it’s own individual portion of the data.

Again, I encourage you to look at using MPI+OpenACC, but if you do want to continue the host thread route, I’ll do my best to help.

Thanks for your advice very much. I will try to use the MPI + OpenACC.
Wish you a Merry Christmas!