Dear NVIDIA experts:

I am using cuda fortran. The code is multi-GPU with omp. I have a random segmentation fault and I have struggled with it several months. Hope I can get help here. Thanks so much!

For small test cases, there is no problem on many (super)computers.

For large applications on Casper (a NCAR supercomputer), I didn’t get problem at the beginning two months. Then, there is random segmentation fault (core dumped). Sometimes, it runs tens of minutes, sometimes several hours, sometimes no errors in 24 hours which is the upper limitation on Casper.

Then I add a few new features into the code. The segmentation fault (core dumped) randomly appears even in test cases on Casper. But when I run the new code on other (super)computers, no errors appear.

I then compile the code use **-g** and submit the test case using **CUDA-MEMCHECK** . I captured **double free or corruption (fasttop)** . I don’t know why since I know where I allocate/deallocate very well and I also don’t know if this is actually the same error with **segmentation fault (core dumped)** under release mode.

!$omp parallel &

!$omp shared(P,dz2,Vx,Vy,Vz,Saturation,Porosity,EvapTrans, &

!$omp np_ps,block_size,kk,np_active,nx,ny,nz,pfnt,&

!$omp pfdt,moldiff,dx,dy,denh2o,dtfrac,xmin,ymin,zmin,&

!$omp xmax,ymax,zmax,pp,nind,Ind), &

!$omp private(tnum,istat,P_de,C_de,dz_de,Vx_de,Vy_de,Vz_de, &

!$omp EvapTrans_de,Saturation_de,Porosity_de,Ind_de, &

!$omp out_age_de,out_mass_de,out_comp_de,out_np_de, &

!$omp et_age_de,et_mass_de,et_comp_de,et_np_de), &

!$omp reduction(+:out_age_cpu,out_mass_cpu,out_comp_cpu,out_np_cpu, &

!$omp et_age_cpu,et_mass_cpu,et_comp_cpu,et_np_cpu,C)

pp = omp_get_num_threads()

tnum = omp_get_thread_num()

istat = cudaSetDevice(tnum)

np_ps=(pp*block_size-mod(np_active,pp*block_size)+np_active)/pp

allocate(P_de(np_ps,12+2*nind))
P_de = P(1+tnum*np_ps:(tnum+1)

*np_ps,1:12+2*nind)

C_de = C

dz_de = dz2

Vx_de = Vx

Vy_de = Vy

Vz_de = Vz

Saturation_de = Saturation

Porosity_de = Porosity

EvapTrans_de = EvapTrans

Ind_de = Ind

out_age_de = out_age_cpu

out_mass_de = out_mass_cpu

out_comp_de = out_comp_cpu

out_np_de = out_np_cpu

et_age_de = et_age_cpu

et_mass_de = et_mass_cpu

et_comp_de = et_comp_cpu

et_np_de = et_np_cpu

call particles_independent <<< np_ps/block_size, block_size, &

block_size*(12+2

*nind)*nind)

*8 >>> (&*

P_de,C_de,dz_de,EvapTrans_de,Vx_de,Vy_de,Vz_de,Saturation_de,&

Porosity_de,out_age_de,out_mass_de,out_comp_de,et_age_de,&

et_mass_de,et_comp_de,out_np_de,et_np_de,Ind_de,&

kk,np_ps,nx,ny,nz,pfnt,nind,&

pfdt,moldiff,dx,dy,denh2o,dtfrac,xmin,ymin,zmin,&

xmax,ymax,zmax,tnum)

P(1+tnumnp_ps:(tnum+1)P_de,C_de,dz_de,EvapTrans_de,Vx_de,Vy_de,Vz_de,Saturation_de,&

Porosity_de,out_age_de,out_mass_de,out_comp_de,et_age_de,&

et_mass_de,et_comp_de,out_np_de,et_np_de,Ind_de,&

kk,np_ps,nx,ny,nz,pfnt,nind,&

pfdt,moldiff,dx,dy,denh2o,dtfrac,xmin,ymin,zmin,&

xmax,ymax,zmax,tnum)

P(1+tnum

*np_ps,1:12+2*nind) = P_de(1:np_ps,1:12+2deallocate(P_de)

C = C_de

out_age_cpu = out_age_de

out_mass_cpu = out_mass_de

out_comp_cpu = out_comp_de

out_np_cpu = out_np_de

et_age_cpu = et_age_de

et_mass_cpu = et_mass_de

et_comp_cpu = et_comp_de

et_np_cpu = et_np_de

!$omp end parallel

I guess the segfault error appears here. it is also the only place I call the kernel function. Do you see any obvious problem in this parallel region?