Unified binary segfault if running on host

I’m using PGI Fortran 11.9
The following test program:

Program test
  parameter (NPOINT = 8)
  real, device, dimension(NPOINT*2) :: FSPC
!$acc region
      DO I=3,NPOINT,2                                                 
      FSPC(I)=AAR*FSPC(I-2)-AAI*FSPC(I-1)                                        
      FSPC(I+1)=AAR*FSPC(I-1)+AAI*FSPC(I-2)                                      
      End Do
!$acc end region      
  End

gets segfault, if it is run on host with environment variable ACC_DEVICE=HOST.

Here is how compilation execution is made, and also dissasemly of program, made with gdb.

/tmp>
pgf95 -Mcuda=cc11 -ta=nvidia,cc11,host -tp=amd64 -Minfo=accel test.f95
test:
5, Loop carried dependence of ‘fspc’ prevents parallelization
Loop carried backward dependence of ‘fspc’ prevents vectorization
Accelerator kernel generated
5, !$acc do seq
Non-stride-1 accesses for array ‘fspc’
/tmp> ./a.out
/tmp> set -x ACC_DEVICE HOST
/tmp> ./a.out
fish: Job 1, “./a.out” terminated by signal SIGSEGV (Address boundary error)
/tmp> gdb ./a.out
GNU gdb (Ubuntu/Linaro 7.3-0ubuntu2) 7.3-2011.08
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type “show copying”
and “show warranty” for details.
This GDB was configured as “x86_64-linux-gnu”.
For bug reporting instructions, please see:

Reading symbols from /tmp/a.out…done.
(gdb) run
Starting program: /tmp/a.out
[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x0000000000402c23 in test.pgi.uni.k8_ () at ./test.f95:5
5 DO I=3,NPOINT,2
(gdb) disas
Dump of assembler code for function test.pgi.uni.k8_:
0x0000000000402bd0 <0>: push %rbp
0x0000000000402bd1 <1>: mov %rsp,%rbp
0x0000000000402bd4 <4>: sub $0x10,%rsp
0x0000000000402bd8 <8>: xor %eax,%eax
0x0000000000402bda <10>: mov $0x6545e0,%edi
0x0000000000402bdf <15>: movss %xmm0,-0x4(%rbp)
0x0000000000402be4 <20>: movss %xmm1,-0x8(%rbp)
0x0000000000402be9 <25>: callq 0x4145e0 <pghpf_init>
0x0000000000402bee <30>: xor %eax,%eax
0x0000000000402bf0 <32>: mov $0x6545e4,%edi
0x0000000000402bf5 <37>: mov $0x6545e8,%esi
0x0000000000402bfa <42>: callq 0x404658 <pgf90_dev_auto_alloc>
0x0000000000402bff <47>: movss -0x8(%rbp),%xmm1
0x0000000000402c04 <52>: movss -0x4(%rbp),%xmm0
0x0000000000402c09 <57>: mov %rax,%rdi
0x0000000000402c0c <60>: mov $0x3,%ecx
0x0000000000402c11 <65>: lea 0xc(%rdi),%rax
0x0000000000402c15 <69>: data32 nopw %cs:0x0(%rax,%rax,1)
0x0000000000402c20 <80>: movaps %xmm1,%xmm2
=> 0x0000000000402c23 <83>: movss -0xc(%rax),%xmm3
0x0000000000402c28 <88>: mulss %xmm2,%xmm3
0x0000000000402c2c <92>: movaps %xmm0,%xmm4
—Type to continue, or q to quit—
0x0000000000402c2f <95>: movss -0x8(%rax),%xmm5
0x0000000000402c34 <100>: mulss %xmm4,%xmm5
0x0000000000402c38 <104>: subss %xmm5,%xmm3
0x0000000000402c3c <108>: movss -0x8(%rax),%xmm6
0x0000000000402c41 <113>: movss -0xc(%rax),%xmm5
0x0000000000402c46 <118>: movss %xmm3,-0x4(%rax)
0x0000000000402c4b <123>: mulss %xmm5,%xmm4
0x0000000000402c4f <127>: mulss %xmm6,%xmm2
0x0000000000402c53 <131>: addss %xmm2,%xmm4
0x0000000000402c57 <135>: movss %xmm4,(%rax)
0x0000000000402c5b <139>: add $0x8,%rax
0x0000000000402c5f <143>: dec %ecx
0x0000000000402c61 <145>: test %ecx,%ecx
0x0000000000402c63 <147>: jg 0x402c20
0x0000000000402c65 <149>: xor %eax,%eax
0x0000000000402c67 <151>: callq 0x4046f5 <pgf90_dev_auto_dealloc>
0x0000000000402c6c <156>: leaveq
0x0000000000402c6d <157>: retq
End of assembler dump.
(gdb)

Hi Senya,

The “device” attribute is a CUDA Fortran construct and CUDA Fortran doesn’t have the concept of a Unified Binary. Hence, when you try to run the Accelerator region on the host, the data is still over on the device. If you wish to use the Unified Binary concept, it must be a pure PGI Accelerator Model program. In other words, remove the “device” attribute.

Note, that the loop is not parallelizable due to the backwards dependencies of FSPC so may be a poor choice for acceleration.

  • Mat

Thanks for help.