Random segmentation fault

Dear NVIDIA experts:
I am using cuda fortran. The code is multi-GPU with omp. I have a random segmentation fault and I have struggled with it several months. Hope I can get help here. Thanks so much!

For small test cases, there is no problem on many (super)computers.
For large applications on Casper (a NCAR supercomputer), I didn’t get problem at the beginning two months. Then, there is random segmentation fault (core dumped). Sometimes, it runs tens of minutes, sometimes several hours, sometimes no errors in 24 hours which is the upper limitation on Casper.
Then I add a few new features into the code. The segmentation fault (core dumped) randomly appears even in test cases on Casper. But when I run the new code on other (super)computers, no errors appear.
I then compile the code use -g and submit the test case using CUDA-MEMCHECK . I captured double free or corruption (fasttop) . I don’t know why since I know where I allocate/deallocate very well and I also don’t know if this is actually the same error with segmentation fault (core dumped) under release mode.

!$omp parallel &
!$omp shared(P,dz2,Vx,Vy,Vz,Saturation,Porosity,EvapTrans, &
!$omp np_ps,block_size,kk,np_active,nx,ny,nz,pfnt,&
!$omp pfdt,moldiff,dx,dy,denh2o,dtfrac,xmin,ymin,zmin,&
!$omp xmax,ymax,zmax,pp,nind,Ind), &
!$omp private(tnum,istat,P_de,C_de,dz_de,Vx_de,Vy_de,Vz_de, &
!$omp EvapTrans_de,Saturation_de,Porosity_de,Ind_de, &
!$omp out_age_de,out_mass_de,out_comp_de,out_np_de, &
!$omp et_age_de,et_mass_de,et_comp_de,et_np_de), &
!$omp reduction(+:out_age_cpu,out_mass_cpu,out_comp_cpu,out_np_cpu, &
!$omp et_age_cpu,et_mass_cpu,et_comp_cpu,et_np_cpu,C)
pp = omp_get_num_threads()
tnum = omp_get_thread_num()
istat = cudaSetDevice(tnum)
np_ps=(ppblock_size-mod(np_active,ppblock_size)+np_active)/pp
allocate(P_de(np_ps,12+2nind))
P_de = P(1+tnum
np_ps:(tnum+1)np_ps,1:12+2nind)
C_de = C
dz_de = dz2
Vx_de = Vx
Vy_de = Vy
Vz_de = Vz
Saturation_de = Saturation
Porosity_de = Porosity
EvapTrans_de = EvapTrans
Ind_de = Ind
out_age_de = out_age_cpu
out_mass_de = out_mass_cpu
out_comp_de = out_comp_cpu
out_np_de = out_np_cpu
et_age_de = et_age_cpu
et_mass_de = et_mass_cpu
et_comp_de = et_comp_cpu
et_np_de = et_np_cpu
call particles_independent <<< np_ps/block_size, block_size, &
block_size*(12+2nind)8 >>> (&
P_de,C_de,dz_de,EvapTrans_de,Vx_de,Vy_de,Vz_de,Saturation_de,&
Porosity_de,out_age_de,out_mass_de,out_comp_de,et_age_de,&
et_mass_de,et_comp_de,out_np_de,et_np_de,Ind_de,&
kk,np_ps,nx,ny,nz,pfnt,nind,&
pfdt,moldiff,dx,dy,denh2o,dtfrac,xmin,ymin,zmin,&
xmax,ymax,zmax,tnum)
P(1+tnum
np_ps:(tnum+1)np_ps,1:12+2nind) = P_de(1:np_ps,1:12+2
nind)
deallocate(P_de)
C = C_de
out_age_cpu = out_age_de
out_mass_cpu = out_mass_de
out_comp_cpu = out_comp_de
out_np_cpu = out_np_de
et_age_cpu = et_age_de
et_mass_cpu = et_mass_de
et_comp_cpu = et_comp_de
et_np_cpu = et_np_de
!$omp end parallel

I guess the segfault error appears here. it is also the only place I call the kernel function. Do you see any obvious problem in this parallel region?

Hi Chen_yeng,

Without a reproducing example, it’s very difficult to tell what’s going on. Though given the somewhat random nature of the error, I’d start by looking for a stack-overflow or race condition. Though, since you’re using OpenMP to manage multiple devices, it’s possible there’s an issue with how the private variables are being allocated and deallocated. Personally, I recommend using MPI to manage multiple devices.

Does the data set being processed seem to effect the error?
How many OpenMP threads are being run?
Does the error occur when using only 1 OpenMP thread?
Does the number of devices being used effect the error?

-Mat

Hi Mat,

Thanks so much for your response and sorry for my late reply.

For your questions:

Does the data set being processed seem to effect the error?

I think so, since it looks more frequent in large applications. But I don’t have direct evidence for it.

The most related array to large or small computation load is the P array as shown in the code posted on the Forum.

As you can see, it is allocate and deallocated explicitly in the parallel region as that in your example.

How many OpenMP threads are being run?

I used 4 or 8 threads on Casper (a NCAR supercomputer) where there is the error.

I used one thread on my desktop, no errors.

I used 1/2/3/4 threads on a workstation, no errors.

I used 1/2/3/4 threads on Tianhe Supercomputer (in China), no errors.

I used 1/2 threads on a supercomputer at Sustech (a Chinese University), no error.

Does the error occur when using only 1 OpenMP thread?

No, As mentioned above, 1 thread on my desktop/Tianhe/sustech, no errors.

Does the number of devices being used effect the error?

Also as mentioned above, looks NO.

On Casper, no matter how many threads were used, there are errors.

On others, no matter how many threads were used, there are no errors.

You might be confused that why I use Casper, since it has 8 GPUs per node which can meet my requirement while others don’t have.

I checked the author of the reference Example I used for my OMP+multi-GPU programming, and found it is you. I am so excited!

In fact, for the private device variables used in the OMP parallel region (I posted on the Forum), I didn’t allocate them explicitly.

I think they are allocated by assigning host arrays to them (FORTRAN syntax) and deallocated when code exits the OMP parallel region (I tested).

Hence, based on my understanding, there should be no wrong use of the OMP or FORTRAN.

But I indeed tried explicit allocate and deallocate them in the OMP parallel region, but it didn’t help.

Thanks again,

Happy holidays!

Chen

Hi Mat,

In addition to last email, for stack size, it is unlimited, and for race condition, I think there is no such an situation in my code.

But when I read the user manual of HPC SDK 20.11 this afternoon, I found a paragraph about the OMP_STACKSIZE.

I never set this environment variable before, either on Casper or on other computers. I am not sure if this matters? I will try later.

One more question is do you know where I can find more official examples about multi-GPU with OMP.

I agree with you that it is better to use MPI and I indeed plan to use MPI in the future development, but for current code, I have to use it to finish some applications.

Thanks a lot for your suggestion!

Best,

Chen

And also, I am not sure if CUDA Fortran, OMP, and Fortran have any conflicts to use the stack (especially CUDA FORTRAN and OMP), so there is double free which might also be the segfault under release mode.

Just for your information. I don’t know much about this since it is too professional.

Thanks,

Chen

Hi Chen,

Ok, so then the problem is specific to running on Casper (https://www2.cisl.ucar.edu/resources/computational-systems/casper).

Just a guess, but could it be a srun config issue such as not running on the correct nodes or not allocating enough GPUs to the job? Is there anyone locally at University of Wyoming that might help?

Your code just assigns the device number to the same as the OMP thread number which doesn’t protect against having more threads than there are GPUs. Not that it’s the issue here, but you may want to keep track of the number of devices (by calling cudaGetDeviceCount) and then do a mod operation to set the device (deviceNum = mod(tnum,nDevices))

-Mat

Thank you very much, Mat! Let me try what you said.

In fact, I set the OMP threads the same as the GPU number. Can this help?

Several more questions in my additional try:

Ok, that’s fine then. Though I still encourage you to not use OpenMP to manage multiple devices and instead use MPI. Granted, you have it working on the other systems so it’s probably ok, but using MPI is more straight forward and using CUDA Aware MPI allows for direct device to device data transfers.

Thanks Mat.

Though it is OK on other devices, I have to do large applications on Casper.

I always think I might not use OMP correctly, so I tried these new codes, but they failed with new problems which I think should not happen since I did that based on PGI manual.

What do you think about them? Could you give some ideas?

I think I don’t need device to device transfer at current time. I thought your recommendation of MPI last time is due to the thread-safety.

Do you think the OMP with CUDA Fortran sometimes is not that safe?

Thank you very much!

Chen

Hi Dick,

Together with your last tests and Mat’s following questions. I think there is some other useful information (though probably they are still not the reason).

I said there was no error when I used the code on Casper at the beginning several months.

In fact, Casper was idle at that time. I think the devices were all set successfully.

However, at the time the error began, Casper has been much busier. So I think it might be due to the failure of setup of device sometimes.

Especially considering the double free under debug mode, I think, for example, if one thread (for example thread3) failed to set a device, it will use device 0 as default.

Then when code exited the parallel region, the device array would be deallocated twice on device 0 due to the operations from threads 0 and 3.

This might also answer my question that the code becomes slow, since, in fact, the code used less GPUs than I wanted.

I asked Shiquan before, if I submitted a job using 4 GPUs, can anyone else submitting a job later than me use the same GPUs with me.

Shiquan answered if I use gres to submit, others cannot share the same GPUs, if I use type, others will share GPUs with me.

Hence, at that time, I thought if I submit a 8-GPU job (a node) using gres, it should be OK. But error still appeared when I did so.

In fact, there is still something unclear. If one has submitted a job using type, then I submit my 8-gpu job using gres, will we share GPUs?

Can you help me check this with Brian when you have the chance?

Thanks much!

Chen

Sorry Mat,
My other questions were not appears here but I indeed included them in the email. let me post them here, could you help to answer?
Several more questions in my additional try:
It looks sometimes the cudasetdevice can fail. But why? How to avoid this? Is the API case sensitive?
When the array is too large, the pinned memory allocation will fail. If there is enough pinned memory, is there other reason for this?
There is also another weird problem. In order to avoid ‘private’ the device arrays at the entrance of the OMP parallel region. I rewrite the code using different names of device arrays on different GPUs. There is indeed such an official example in PGI manual. But is doesn’t work as follows (tens of arrays). If I only allocate one array for one device, it indeed works. But two arrays will fail.
!$omp parallel private(tnum,istat)
tnum = omp_get_thread_num()
istat = cudaSetDevice(tnum)
if(tnum .eq. 0)then
allocate(dz_de1(nz))
allocate(Vx_de1(nnx,ny,nz), Vy_de1(nx,nny,nz), Vz_de1(nx,ny,nnz), Ind_de1(nx,ny,nz))
allocate(Saturation_de1(nx,ny,nz), Porosity_de1(nx,ny,nz),EvapTrans_de1(nx,ny,nz))
allocate(C_de1(n_constituents,nx,ny,nz))
allocate(out_np_de1(1),ET_np_de1(1))
allocate(out_age_de1(1),out_mass_de1(1),out_comp_de1(3))
allocate(ET_age_de1(1),ET_mass_de1(1),ET_comp_de1(3))
dz_de1 = dz2
Ind_de1 = Ind
Porosity_de1 = Porosity
elseif(tnum .eq. 1)then
allocate(dz_de2(nz))
allocate(Vx_de2(nnx,ny,nz), Vy_de2(nx,nny,nz), Vz_de2(nx,ny,nnz), Ind_de2(nx,ny,nz))
allocate(Saturation_de2(nx,ny,nz), Porosity_de2(nx,ny,nz),EvapTrans_de2(nx,ny,nz))
allocate(C_de2(n_constituents,nx,ny,nz))
allocate(out_np_de2(1),ET_np_de2(1))
allocate(out_age_de2(1),out_mass_de2(1),out_comp_de2(3))
allocate(ET_age_de2(1),ET_mass_de2(1),ET_comp_de2(3))
dz_de2 = dz2
Ind_de2 = Ind
Porosity_de2 = Porosity
elseif(tnum .eq. 2)then
allocate(dz_de3(nz))
allocate(Vx_de3(nnx,ny,nz), Vy_de3(nx,nny,nz), Vz_de3(nx,ny,nnz), Ind_de3(nx,ny,nz))
allocate(Saturation_de3(nx,ny,nz), Porosity_de3(nx,ny,nz),EvapTrans_de3(nx,ny,nz))
allocate(C_de3(n_constituents,nx,ny,nz))
allocate(out_np_de3(1),ET_np_de3(1))
allocate(out_age_de3(1),out_mass_de3(1),out_comp_de3(3))
allocate(ET_age_de3(1),ET_mass_de3(1),ET_comp_de3(3))
dz_de3 = dz2
Ind_de3 = Ind
Porosity_de3 = Porosity
else
allocate(dz_de4(nz))
allocate(Vx_de4(nnx,ny,nz), Vy_de4(nx,nny,nz), Vz_de4(nx,ny,nnz), Ind_de4(nx,ny,nz))
allocate(Saturation_de4(nx,ny,nz), Porosity_de4(nx,ny,nz),EvapTrans_de4(nx,ny,nz))
allocate(C_de4(n_constituents,nx,ny,nz))
allocate(out_np_de4(1),ET_np_de4(1))
allocate(out_age_de4(1),out_mass_de4(1),out_comp_de4(3))
allocate(ET_age_de4(1),ET_mass_de4(1),ET_comp_de4(3))
dz_de4 = dz2
Ind_de4 = Ind
Porosity_de4 = Porosity
endif
!$omp end parallel

But if I don’t use omp, just do as follows, it works well

istat = cudaSetDevice(0)
    allocate(dz_de1(nz))
    allocate(Vx_de1(nnx,ny,nz), Vy_de1(nx,nny,nz), Vz_de1(nx,ny,nnz), Ind_de1(nx,ny,nz))
    allocate(Saturation_de1(nx,ny,nz), Porosity_de1(nx,ny,nz),EvapTrans_de1(nx,ny,nz))
    allocate(C_de1(n_constituents,nx,ny,nz))
    allocate(out_np_de1(1),ET_np_de1(1))
    allocate(out_age_de1(1),out_mass_de1(1),out_comp_de1(3))
    allocate(ET_age_de1(1),ET_mass_de1(1),ET_comp_de1(3))
    dz_de1 = dz2
    Ind_de1 = Ind
    Porosity_de1 = Porosity

istat = cudaSetDevice(1)
    allocate(dz_de2(nz))
    allocate(Vx_de2(nnx,ny,nz), Vy_de2(nx,nny,nz), Vz_de2(nx,ny,nnz), Ind_de2(nx,ny,nz))
    allocate(Saturation_de2(nx,ny,nz), Porosity_de2(nx,ny,nz),EvapTrans_de2(nx,ny,nz))
    allocate(C_de2(n_constituents,nx,ny,nz))
    allocate(out_np_de2(1),ET_np_de2(1))
    allocate(out_age_de2(1),out_mass_de2(1),out_comp_de2(3))
    allocate(ET_age_de2(1),ET_mass_de2(1),ET_comp_de2(3))
    dz_de2 = dz2
    Ind_de2 = Ind
    Porosity_de2 = Porosity

istat = cudaSetDevice(2)
    allocate(dz_de3(nz))
    allocate(Vx_de3(nnx,ny,nz), Vy_de3(nx,nny,nz), Vz_de3(nx,ny,nnz), Ind_de3(nx,ny,nz))
    allocate(Saturation_de3(nx,ny,nz), Porosity_de3(nx,ny,nz),EvapTrans_de3(nx,ny,nz))
    allocate(C_de3(n_constituents,nx,ny,nz))
    allocate(out_np_de3(1),ET_np_de3(1))
    allocate(out_age_de3(1),out_mass_de3(1),out_comp_de3(3))
    allocate(ET_age_de3(1),ET_mass_de3(1),ET_comp_de3(3))
    dz_de3 = dz2
    Ind_de3 = Ind
    Porosity_de3 = Porosity

istat = cudaSetDevice(3)
    allocate(dz_de4(nz))
    allocate(Vx_de4(nnx,ny,nz), Vy_de4(nx,nny,nz), Vz_de4(nx,ny,nnz), Ind_de4(nx,ny,nz))
    allocate(Saturation_de4(nx,ny,nz), Porosity_de4(nx,ny,nz),EvapTrans_de4(nx,ny,nz))
    allocate(C_de4(n_constituents,nx,ny,nz))
    allocate(out_np_de4(1),ET_np_de4(1))
    allocate(out_age_de4(1),out_mass_de4(1),out_comp_de4(3))
    allocate(ET_age_de4(1),ET_mass_de4(1),ET_comp_de4(3))
    dz_de4 = dz2
    Ind_de4 = Ind
    Porosity_de4 = Porosity

So why?

I might give up the OMP. I have struggled several months on this. I am trying MPI. But I still want to know the answer for the above questions.
Thanks so much!

Fortran is not case-sensitive. As for the failure, do you have an error code?

The error codes are listed here and might help in understand why it’s failing: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__TYPES.html#group__CUDA__TYPES_1gc6c391505e117393cc2558fff6bfc2e9

When the array is too large, the pinned memory allocation will fail. If there is enough pinned memory, is there other reason for this?

Pinned memory is allocated in the host’s physical memory so can run out. Though, it looks like the Casper nodes have 768GB and 1152GB, which is a lot. So it may be a an srun or shell limit on the physical memory.

Though how big is “too large”? If allocating arrays larger than 2GB, you need to add the flag “-Mlarge_array” or “-mcmodel=medium”.

Also, how does it fail? During allocation? If so, what’s the status that get’s returned? Or when accessing the array, in which case, it’s most like the 2GB limit as mentioned above.

So why?

It’s very difficult for me to tell without a reproducer or more information. How does it fail? Allocation? While executing on the device?

I might give up the OMP. I have struggled several months on this. I am trying MPI.

Granted, I don’t know your code or algorithm, but personally, that’s what I’d do. Not only will it make your code much cleaner, it will allow the code to scale not only to more GPUs but also across multiple nodes.

Thank you so much, Mat!