Dear NVIDIA experts:
I am using cuda fortran. The code is multi-GPU with omp. I have a random segmentation fault and I have struggled with it several months. Hope I can get help here. Thanks so much!
For small test cases, there is no problem on many (super)computers.
For large applications on Casper (a NCAR supercomputer), I didn’t get problem at the beginning two months. Then, there is random segmentation fault (core dumped). Sometimes, it runs tens of minutes, sometimes several hours, sometimes no errors in 24 hours which is the upper limitation on Casper.
Then I add a few new features into the code. The segmentation fault (core dumped) randomly appears even in test cases on Casper. But when I run the new code on other (super)computers, no errors appear.
I then compile the code use -g and submit the test case using CUDA-MEMCHECK . I captured double free or corruption (fasttop) . I don’t know why since I know where I allocate/deallocate very well and I also don’t know if this is actually the same error with segmentation fault (core dumped) under release mode.
!$omp parallel &
!$omp shared(P,dz2,Vx,Vy,Vz,Saturation,Porosity,EvapTrans, &
!$omp np_ps,block_size,kk,np_active,nx,ny,nz,pfnt,&
!$omp pfdt,moldiff,dx,dy,denh2o,dtfrac,xmin,ymin,zmin,&
!$omp xmax,ymax,zmax,pp,nind,Ind), &
!$omp private(tnum,istat,P_de,C_de,dz_de,Vx_de,Vy_de,Vz_de, &
!$omp EvapTrans_de,Saturation_de,Porosity_de,Ind_de, &
!$omp out_age_de,out_mass_de,out_comp_de,out_np_de, &
!$omp et_age_de,et_mass_de,et_comp_de,et_np_de), &
!$omp reduction(+:out_age_cpu,out_mass_cpu,out_comp_cpu,out_np_cpu, &
!$omp et_age_cpu,et_mass_cpu,et_comp_cpu,et_np_cpu,C)
pp = omp_get_num_threads()
tnum = omp_get_thread_num()
istat = cudaSetDevice(tnum)
np_ps=(ppblock_size-mod(np_active,ppblock_size)+np_active)/pp
allocate(P_de(np_ps,12+2nind))
P_de = P(1+tnumnp_ps:(tnum+1)np_ps,1:12+2nind)
C_de = C
dz_de = dz2
Vx_de = Vx
Vy_de = Vy
Vz_de = Vz
Saturation_de = Saturation
Porosity_de = Porosity
EvapTrans_de = EvapTrans
Ind_de = Ind
out_age_de = out_age_cpu
out_mass_de = out_mass_cpu
out_comp_de = out_comp_cpu
out_np_de = out_np_cpu
et_age_de = et_age_cpu
et_mass_de = et_mass_cpu
et_comp_de = et_comp_cpu
et_np_de = et_np_cpu
call particles_independent <<< np_ps/block_size, block_size, &
block_size*(12+2nind)8 >>> (&
P_de,C_de,dz_de,EvapTrans_de,Vx_de,Vy_de,Vz_de,Saturation_de,&
Porosity_de,out_age_de,out_mass_de,out_comp_de,et_age_de,&
et_mass_de,et_comp_de,out_np_de,et_np_de,Ind_de,&
kk,np_ps,nx,ny,nz,pfnt,nind,&
pfdt,moldiff,dx,dy,denh2o,dtfrac,xmin,ymin,zmin,&
xmax,ymax,zmax,tnum)
P(1+tnumnp_ps:(tnum+1)np_ps,1:12+2nind) = P_de(1:np_ps,1:12+2nind)
deallocate(P_de)
C = C_de
out_age_cpu = out_age_de
out_mass_cpu = out_mass_de
out_comp_cpu = out_comp_de
out_np_cpu = out_np_de
et_age_cpu = et_age_de
et_mass_cpu = et_mass_de
et_comp_cpu = et_comp_de
et_np_cpu = et_np_de
!$omp end parallel
I guess the segfault error appears here. it is also the only place I call the kernel function. Do you see any obvious problem in this parallel region?