Runtime error for MPI-OpenACC code


My code (SPH numerical code) gives runtime errors, which (I believe) is quite weird. When I ran my program on a Desktop ( NVIDIA-SMI 545.23.06 / Driver Version: 545.23.06 / CUDA Version: 12.3 / RTX4090 ) it didn’t give me a runtime error, and the results were also good. (Actually-- I will talk more about it later.)

But when I ran my program on a cluster ( NVIDIA-SMI 450.51.05 / Driver Version: 450.51.05 / CUDA Version: 11.0 / V100 ) it gave me a runtime error message as: Failing in Thread:1 call to cuMemcpyDtoHAsync returned error 1: Invalid value

I could narrow down the spot that gives the runtime error:

part of MPI particle transfer subroutine... generating send buffer...

1 if(idXM /= MPI_PROC_NULL) then
2   !$acc parallel loop independent gang 
3   do k=1,ngr
4   !!$acc parallel loop independent gang vector private(j,npnn) ! async(k)
5   !$acc loop independent vector private(j,npnn) 
6   do i=gnp(k-1)+1, gnp(k)
7       if(xt(1,i) < xcIDmin) then
8            !$acc atomic capture
9            npsd(k) = npsd(k) + 1 ! number of moved particles in every group
10            j = npsd(k)
11            !$acc end atomic
12            np(j,k) = i
14            !$acc atomic capture
15            npn = npn + 1
16            npnn = npn
17            !$acc end atomic 
19            sbuf(ix1:ix2,npnn) = xt(:,i)
20            sbuf(iv1:iv2,npnn) = vt(:,i)
21            sbuf(ivn1:ivn2,npnn) = vn(:,i)
22            sbuf(ip,npnn) = p(i)
23            sbuf(ir,npnn) = rho(i)
24            sbuf(im,npnn) = mass(i)
25            sbuf(imu,npnn) = mu(i)
26            sbuf(irn,npnn) = rhon(i)
27            sbuf(ite,npnn) = Te(i)
28            sbuf(ikp,npnn) = kapa(i)
29            sbuf(iten,npnn)= Ten(i)
30            sbuf(imtn,npnn) = mtn(i)
31            sbuf(iYv,npnn) = Yv(i)
32            sbuf(iYvn,npnn)= Yvn(i)
33            sbuf(iden,npnn) = drop_identify(i)
34            sbuf(itp,npnn) = pType(i)
35        end if
36    end do
37    end do
38    !$acc wait

If I only parallelize the inner loop ( line 4: !!$acc parallel loop independent gang vector private(j,npnn) ), I got the runtime error in the Desktop as well:

Failing in Thread:1
Accelerator Fatal Error: call to cuStreamSynchronize returned error 700: Illegal address during kernel execution
File: /home/yongcho/TEST/og_res_MPI-GPU/C_link_list.f90
Function: yx_link_list:4
Line: 60

It looks like it gives a runtime error after certain iterations later.
Any insight will be appreciated!

Thank you,

Also, I tried with “-ta=tesla:managed”, as other posts recommended. It ran a couple of iterations in the cluster, but got the runtime error.

Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

It looks like it got an error if the particles is larger then 2 in the sending buffer…

How big is “sbuf”? Is it greater than 2GB? If so, try adding “-Mlarge_arrays” so objects >2GB can be used.

Next I’d check if you have any out-of-bound accesses. To do this, I’d recommend compiling for the host (i.e. no OpenACC) and run the program through Valgrind.

Other possible causes would be a stack or heap overflow, but I’m not seeing anything here that would cause either.

If you can create a reproducing example, I can take a look in more detail.

I think there are some out-of-bound accesses. Thank you for the direction!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.