From PGI 19.10 community edition to NVIDIA HPC 21.2: "call to cuStreamSynchronize returned error 700: Illegal address during kernel execution"

Let me describe the problem I met.

The fortran code reads as:

!$acc parallel loop copyin(wav, lev, chunk, MKKL, zero, two) creat(denf) copyout(denf) private(MK,ip,MKj,ib, &
!$acc & ik1,ik2,itx, it1,nu,i0,vv, i)
do MK = 1, MKT*2

!--- Index
    ip  = (MK-1)/MKT;                       MKj = MK - ip*MKT
    ib  = MKKL(MKj,ip)%ib
    ik1 = MKKL(MKj,ip)%k1;                  ik2 = MKKL(MKj,ip)%k2

    !--- Initialize the non-local densities
    do itx = 1, IBX
        
        denf(itx,MK)%sgg    = zero;             denf(itx,MK)%sgf    = zero
        denf(itx,MK)%sff    = zero;             denf(itx,MK)%sfg    = zero

        do nu = 1, chunk(itx)%md(ib)

        !--- index of the single-particle states in structures "lev" and "wav"
            i0  = chunk(itx)%ma(ib) + nu
            vv  = lev(itx)%vv(i0)*lev(itx)%mu(i0)                !--- occupation number
            
        !--- contributions to the non-local densities from k1 & k2 blocks of the DWS base
            do i = 1, MSD
                denf(itx,MK)%sgg(:,i) = denf(itx,MK)%sgg(:,i) + wav(i0,itx)%XG(:,ik1)*wav(i0,itx)%XG(i,ik2)*vv
                denf(itx,MK)%sgf(:,i) = denf(itx,MK)%sgf(:,i) + wav(i0,itx)%XG(:,ik1)*wav(i0,itx)%XF(i,ik2)*vv
                denf(itx,MK)%sfg(:,i) = denf(itx,MK)%sfg(:,i) + wav(i0,itx)%XF(:,ik1)*wav(i0,itx)%XG(i,ik2)*vv
                denf(itx,MK)%sff(:,i) = denf(itx,MK)%sff(:,i) + wav(i0,itx)%XF(:,ik1)*wav(i0,itx)%XF(i,ik2)*vv
            end do
        end do
    end do
    
!--- Non-local density for isovector channels
    do itx = 1, IBX
        it1 = 3 - itx
        denf(itx,MK)%vgg = denf(itx,MK)%sgg + two*denf(it1,MK)%sgg
        denf(itx,MK)%vgf = denf(itx,MK)%sgf + two*denf(it1,MK)%sgf
        denf(itx,MK)%vfg = denf(itx,MK)%sfg + two*denf(it1,MK)%sfg
        denf(itx,MK)%vff = denf(itx,MK)%sff + two*denf(it1,MK)%sff
    end do
end do          !--- End loop over 2*MKT

!$acc end parallel loop

When I compile the code with HPC 21.2, and the compile command line reads as
nvfortran -O4 -mcmodel=medium -mp=allcores -mp=bind -acc=gpu -gpu=cc70 -Minfo=accel -o

“denf” is a derived type variable with the size > 2GB, and the following is the definition:
TYPE DENSITF
DOUBLE PRECISION, DIMENSION(MSD, MSD) :: sgg, sff, sgf, sfg
DOUBLE PRECISION, DIMENSION(MSD, MSD) :: vgg, vff, vgf, vfg
END TYPE DENSITF
TYPE (DENSITF), DIMENSION(IBX, MKT*2) :: denf

Before I compile the code with PGI community edition 19.10, and it works well.

While I try to compile with NVIDIA HPC 21.2, it gives warning messages:
invalid tag
!51 = !DIBasicType(tag: DW_TAG_string_type, name: “character”, size: 64, align: 8, encoding: DW_ATE_signed)
invalid tag
!85 = !DIBasicType(tag: DW_TAG_string_type, name: “character”, size: 80, align: 8, encoding: DW_ATE_signed)

Then I run the code, it leads to the error message “call to cuStreamSynchronize returned error 700: Illegal address during kernel execution”.

I have no idea about the warning message above. The variable denf is defined in a module before “Contains”

Hi longwh,

Would you be able to provide a reproducing example?

Since the code worked with 19.10, it’s likely a compiler issue with 21.2. But without being able to reproduce the issue, it’s difficult to tell.

The warning is suspicious, but I’ve seen this before so don’t know it’s meaning.

My one thought is it’s an integer overflow issue? For your index variables, are these declared as just “integer” or are you making them “integer(kind=8)”? If they’re integer, try adding the flag “-i8” to promote the default integer kind to 8.

-Mat

Mat, Thanks a lot.

I did some test again by outputing some numbers, for example at the beginning, mid and the end of the do loop, the accelerated region. I can see the output results, while the error messages still emerge.

I guess it might be due to the size of the variable denf is too big (>2GB). Therefore, I did another test by changing the code as:

!$OMP PARALLEL DO private(id, MK, MKb, MKe)
do id = 0, NGP

!--- Get the thread of OPEM and set the GPU card
    Call ACC_SET_DEVICE_NUM(IDV(id),acc_device_nvidia)
    MKb = AMP%MKb(id)
    MKe = AMP%MKe(id)
!$acc parallel loop copyin(wav, lev, chunk, MKKL, zero, two, MKb, MKe) copyout( denf(1:IBX,MKb:MKe) )  &
!$acc                  & private(MK,ip,MKj,ib,ik1,ik2,itx, it1,nu,i0,vv, i)
    do MK = MKb, MKe


end do !— End loop over 2*MKT
!$acc end parallel loop
end do
!$OMP END PARALLEL Do

It shows the following errors:
Failing in Thread:7
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

Failing in Thread:5
call to cuStreamSynchronize returned error 4: Deinitialized

With the same code, I compile with PGI 19.10 community edition, and it works perfectly.

One more thing to notice that

For PGI 19.10 community edition, the cuda version is

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130

While for NVIDIA HPC 21.2, the cuda version is

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Nov_30_19:08:53_PST_2020
Cuda compilation tools, release 11.2, V11.2.67
Build cuda_11.2.r11.2/compiler.29373293_0

When I install PGI 19.10, it seems to me that the cuda is somewhat automatically installed.

Besides the different on the fortran compile, the difference on cuda can be the origin of the problem?

Is there any additional requirement on using the “derived type” variables?

I also tried by changing the integer to integer (kind=8), and this part I shown above is OK for NVIDIA HPC 20.7, while it does not work with newest version NVIDIA HPC 21.2.

Thanks Longwh. Are you able to put together a reproducing example you can send us?

I guess it might be due to the size of the variable denf is too big (>2GB).

Possibly, but you do have the '-mcmodel=medium" flag enabled which has the compiler using 64-bit offsets for addresses. This is why I wanted you to try ‘-i8’ since the sometimes the indices need to be promoted as well.

An illegal address error is very generic. It’s similar to a seg fault on the host where the runtime encounters a bad address. It could be a out-of-bounds issue, bad pointer being passed to the device, accessing a large array when using the default 32-bit offsets, heap overflow, stack overflow, alignment error, etc.

I’m not as concerned about the DW_TAG since that’s a warning about some DWARF debugging information. I’d still like our engineers to take a look once you can get us an example, but I’m doubtful it’s causing the illegal address error.

When I install PGI 19.10, it seems to me that the cuda is somewhat automatically installed.

There’s no change here other than shipping newer of CUDA and that we now have two download packages. One includes just the latest version of CUDA and a second that includes the last three. Since the CUDA versions are large, we wanted to do something to help reduce the size of the packages. If you need an older CUDA, you’ll want to download the Multi-cuda package.

Besides the different on the fortran compile, the difference on cuda can be the origin of the problem?

Possible, but unlikely. More likely it’s a problem with the Fortran compiler.

Is there any additional requirement on using the “derived type” variables?

Yes. By default, variables in data clauses are shallowed copied. For aggregate type with dynamic data members, you need to perform a deep copy, either manually or by having the Fortran compiler do it for you via the “-gpu=deepcopy” flag. When building with 19.10, are you using the “-ta=tesla:deepcopy” flag?

I just noticed this:
!$acc parallel loop copyin(wav, lev, chunk, MKKL, zero, two) creat(denf) copyout(denf)

You are missing the “e” on create. Is this just typo in your post? This should give a syntax error. Also, variables should only be put into one data clause. Here you have “denf” in both a create and a copyout.

-Mat

Thanks a lot. Mat

Eventually, I solved the problem. The following is the summary.

  1. Integer ==> Integer(kind=8): it really solves the problem. In my understanding, the NVIDIA HPC 21.2 requires the specification for integer variables, at least for openacc, which is not demanded by GPI community edition 19.10.

  2. Invalid tag: I found that it is due to the variables with derived type, which contain character variables. After I removed the relevant character variables, it does not appear any more.

  3. For the derived type, it is not necessary to use ‘-gpu=deepcopy’ with HPC 21.2.

  4. Concerning " !$acc parallel loop copyin(wav, lev, chunk, MKKL, zero, two) creat(denf) copyout(denf)", it is just a typo in the post. In fact, I just use “copyout(denf)”, and it works well.

Now there is another problem. My code can work in the server, which contains 8 GPU cards (Titan V), NVIDIA HPC 21.2. While the same code does not work in another server, which contains 8 GPU cards (Tesla V100) and NVHPC 20.7. It seems to lead to a dead loop.

!— Loop over openmp threads
NGPS = NGP + 1
Call OMP_SET_NUM_THREADS(NGPS)
!$OMP PARALLEL DO private(id, MK1, i)
do id = 0, NGP

!--- Get the thread of OPEM and set the GPU card
    Call ACC_SET_DEVICE_NUM(IDV(id),acc_device_nvidia)

!$acc data copyin (denf,cctm,XBIS,XBIT,XBIV,zero,two,id, xdel,xrtd,xrsd, fpio, grtn) copy( TMR(:,id) )
    do MK1  = AMP%MKb(id), AMP%MKe(id)
        
    !$acc parallel loop copyin( MK1, HSC(:,MK1), HVE(:,MK1), HPV(:,MK1), HTS(:,MK1), HTT(:,MK1), HVT(:,MK1), &
    !$acc             &              HTV(:,MK1), VTH(:,MK1), TVH(:,MK1), HPD(:,MK1), TSD(:,MK1), TTD(:,MK1) ) &
    !$acc             & copyout( eself(:, MK1) ) private(EXC, i, it,LP1, MK2,L,LP2,LY,idw, oPM,oMP,oMM,oPP,  &
    !$acc             & rPM,rMM,rPP,rMP, tPP,tMM,tMP,tPM, sPP,sMM,sMP,sPM, ePP,eMM,eMP,ePM, dPP,dMM,dMP,dPM, &
    !$acc             & XPP,XMM,XPM,XMP, YPP,YPM,YMP,YMM, tSS,tOV,tRV, SST,OVT,RVT )
    
    !--- Initialize
        do i = 1, MSD

        end do      !--- end loop over i
    !$acc end parallel loop
                                
    end do      !--- end loop over MK1
!$acc end data   
end do

!$OMP END PARALLEL DO

I ignored the details between the acc parallel loop.

Can you describe the issue you’re seeing and if possible provide a reproducing example? Also while it’s probably not causing any issues, I’d like to have you engineers look at the warning about the DWARF tag.

Note that I personally don’t like using OpenMP to manage multiple GPUs. It works, but adds a lot of complexity and extra data movement. I prefer using MPI+OpenACC since it allows for a one-to-one relationship between the rank and GPU, allows you to better manage data movement, scales beyond a single node, and you can use CUDA Aware MPI allowing direct transfer of data between GPUs.