Simple parallel region but... core dumped

Dear support,

I added a simple ACC region on the top of a single DO loop. Apparently everything should work easy.

The code is:

!$acc region copyin(aux, eigts1, eigts2, eigts3, mill, g) copyout(aux1)
 do ig = 1, ngm
    cfac = aux (ig, is) * &
           CONJG( eigts1 (mill (1,ig), na) * &
                  eigts2 (mill (2,ig), na) * &
                  eigts3 (mill (3,ig), na) )
    aux1 (ig) = cfac * g (jpol, ig)
 enddo
!$acc end region

“is” and “jpol” are indexes that come from outer loops. “aux1” is used just after the ACC region so I put it in the copyout clause. It does not require any initialization.

The -Minfo output reports:

108, Generating copyout(aux1(:))
Generating copyin(g(jpol,1:ngm))
Generating copyin(mill(1:3,1:ngm))
Generating copyin(eigts3(:,:))
Generating copyin(eigts2(:,:))
Generating copyin(eigts1(:,:))
Generating copyin(aux(:,:))
Generating compute capability 2.0 binary
109, Loop is parallelizable
Accelerator kernel generated
109, !$acc do parallel, vector(32) ! blockidx%x threadidx%x
Non-stride-1 accesses for array ‘g’
Non-stride-1 accesses for array ‘mill’
CC 2.0 : 21 registers; 4 shared, 208 constant, 0 local memory bytes; 16% occupancy

(occupancy is low but well… I am more interested to get OpwnACC working on that specific point now :-P)

And, after the core is generated, this is the point where I get the error.

(gdb) bt
#0 0x0000003513487fc6 in __memcpy_sse2 () from /lib64/libc.so.6
#1 0x00007f313104b691 in ?? () from /usr/lib64/libcuda.so.1
#2 0x00007f31310557b3 in ?? () from /usr/lib64/libcuda.so.1
#3 0x00007f3131055d8c in ?? () from /usr/lib64/libcuda.so.1
#4 0x00007f313104d54e in ?? () from /usr/lib64/libcuda.so.1
#5 0x00007f313102d6b7 in ?? () from /usr/lib64/libcuda.so.1
#6 0x00007f31310300ad in ?? () from /usr/lib64/libcuda.so.1
#7 0x00007f3131020923 in ?? () from /usr/lib64/libcuda.so.1
#8 0x0000000000877ba3 in __pgi_cu_upload2 (devptr=13865189376, hostptr=0xcb4d98, devx=0, devy=0, hostx=0, hosty=0, size1=1, size2=82835, devstride2=1,
hoststride1=1, hoststride2=3, elementsize=8, lineno=108, name=0xba335c “g$p”) at …/src-nv/nvupload2.c:82
#9 0x0000000000873d2d in __pgi_cu_uploadx_seq (devptr=13865189376, hostptr=0xcb4d98, dims=2, desc=0x7fff50988f20, elementsize=8, lineno=108,
name=0xba335c “g$p”) at …/src-nv/nvuploadx.c:236
#10 0x0000000000875661 in pgi_cu_uploadxx_p (devptr=13865189376, hostptr=0xcb4d98, dims=2, desc=0x7fff50988f20, elementsize=8, lineno=108,
name=0xba335c “g$p”, eventinfo=0xbf1190) at …/src-nv/nvuploadx.c:649
#11 0x0000000000875924 in pgi_cu_uploadx_a_p (devptr=13865189376, hostptr=0xcb4d98, dims=2, desc=0x7fff50988f20, elementsize=8, lineno=108,
name=0xba335c “g$p”, flags=0, async=0) at …/src-nv/nvuploadx.c:705
#12 0x00000000005e0cb7 in addusstres.pgi.uni.gpu
(sigmanlc=…) at ./addusstress.F90:108
#13 0x00000000005caae1 in stres_knl.pgi.uni.istanbul
(sigmanlc=…, sigmakin=…) at ./stres_knl.F90:90
#14 0x00000000004ae69e in stress.pgi.uni.istanbul
(sigma=…) at ./stress.F90:116
#15 0x000000000041c32c in pwscf.pgi.uni.istanbul
() at ./pwscf.F90:119

I think I put the ACC region directive in the right place with the right clauses. I do nto see any obstacle inside the loop, CONJG should be supported (I am using PGI 12.2). Is it possible that the program crash at that point due to “not enough memory available”? If yes, how detect and eventually apply a recovery strategy in the code?

Many thanks in advance!
F.

Hi fspiga,

It looks like there is a problem copying g. Exactly what’s wrong, I can’t tell, but assume it has to do with copying only a single row.

Can you put a data region around the outer “jpol” loop and copy all of g? (I’m assuming there is one). Something like:

!$acc data region copyin(g)
do jpol = 1, N
!$acc region copyin(aux, eigts1, eigts2, eigts3, mill) copyout(aux1)
 do ig = 1, ngm
    cfac = aux (ig, is) * &
           CONJG( eigts1 (mill (1,ig), na) * &
                  eigts2 (mill (2,ig), na) * &
                  eigts3 (mill (3,ig), na) )
    aux1 (ig) = cfac * g (jpol, ig)
 enddo
!$acc end region 
enddo
!$acc end data region
  • Mat

Still core dumped.

But I’ve realized that “g” was declared in this way

REAL(DP), ALLOCATABLE, TARGET :: g(:,:)

So I did this change, just to do a test

do jpol = 1, ipol
   g_acc(:)= g(jpol, :)
!$acc region copyin(aux, eigts1, eigts2, eigts3, mill,g_acc) copyout(aux1)
   do ig = 1, ngm
      cfac = aux (ig, is) * &
                   CONJG( eigts1 (mill (1,ig), na) * &
                                  eigts2 (mill (2,ig), na) * &
                                  eigts3 (mill (3,ig), na) )
      aux1 (ig) = cfac * g_acc(ig)
   enddo
!$acc end region
   ...
   ...
enddo

Not the message is:

call to cuMemcpyDtoH returned error 700: Launch failed
CUDA driver version: 4020

and it has more sense. I am going to investigate about it. many thanks!

I just realized that my sysadmin updated this morning the CUDA driver to the new release.

In order to use OpenACC, do I have to use the CUDA 4.0 driver? Is the CUDA 4.1 fine? Is it possible that CUDA 4.2 generates the problem I reported above?

Many thanks again!!!

The problem is the same, also after reverting the driver.

In that piece of code there is another assumption that might be incompatible with open ACC

This operation:

aux1(ig) = cfac * g_acc(ig)

involves “cfac” (COMPLEX), "g_acc(ig) " (REAL) and “aux1” (COMPLEX). Both real and imaginary part of “cfac” are scaled by the value “g_acc(ig)” (REAL) and put in the right place in aux1(ig).

I did a simple test this time, I removed that line. And it works.

Is this “mix of types” allowed in a Open ACC region? If no, what kind of limitations there are on this regard?

Is this “mix of types” allowed in a Open ACC region? If no, what kind of limitations there are on this regard?

I would consider this something we should support. If you can, please send a reproducing example to PGI Customer Support (trs@pgroup.com) and they will file a problem report.

Thanks,
Mat