CPU parallel and accelerator regions in the same program

Hi, I’ve been trying to use the PGI accelerated compiler version 11.10 to parallelize a code across both the CPU and GPU. Best case scenario I was hoping to have something like the following.

#pragma omp parallel 
{
   int tid = omp_get_thread_num();
   printf("id:%d\n", tid);
   if(tid == 0){
      acc_set_device_num(0, acc_device_nvidia);
      #pragma acc region for
      ...
   }else if(tid == 1){
      acc_set_device_num(1, acc_device_nvidia);
      #pragma acc region for
      ...
   }else{
      ...
   }
   
}

where … is some code to accelerate. I tried that first, and ran into segfaults. Now I’m down to trying anything I can think of, but every time I try and have CPU parallel and GPU parallel regions in the same code, I get a segfault in the CPU region. GDB gives me something like the following.

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x2aaab14d6940 (LWP 6785)]
0x00000000004106df in _mp_penter64 ()
(gdb) bt
#0  0x00000000004106df in _mp_penter64 ()

I’ve tried everything I can think of to get these to work together, up to and including using pthreads to create the outer threads and run the cpu and gpu regions in separate pthreads, and always get the same result. Once in a very long while, the code will run without segfault, but when that happens it hangs. The most basic version of what I’ve been trying to do is this.

#define SIZE 100
int main(int argc, char * argv[])
{
    int stuff[SIZE];
    int limit = omp_get_thread_limit();
    printf("limit:%d\n", limit);
#pragma omp parallel shared(stuff)
    {
        int tid = omp_get_thread_num();
        int i;
        if(tid == 0){
#pragma acc region for copy(stuff)
            for(i = 0; i<SIZE; i++)
            {
                stuff[i] = 1;
            }
        }
        printf("thread_id:%d\n", tid);
    }
    return 0;
}

That said, even this fails.

#define SIZE 100
int main(int argc, char * argv[])
{
    int stuff[SIZE];
    int limit = omp_get_thread_limit();
    printf("limit:%d\n", limit);
#pragma omp parallel shared(stuff)
    {
        int tid = omp_get_thread_num();
        printf("thread_id:%d\n", tid);
    }
    int i;
#pragma acc region for copy(stuff)
    for(i = 0; i<SIZE; i++)
    {
        stuff[i] = 1;
    }
    return 0;
}

Which seems to be the same bug as in PGI ACC release 11.0: Multiple GPUs using openmp since it works fine if I remove either one, but not at all if both stay,
but there is no resolution there. Any ideas what might be going wrong?

Platform details:
2 c2050 GPUs
2 6-core intel CPUs
chaos linux 2.6.18-107
PGI accelerator 11.10

Hi njustn,

This is an odd one and like the other forum post, both programs work fine for me. Hence, I suspect it’s something specific to your system.

The “_mp_penter64” symbol is the entry point for a parallel region and one of the things it does is to set-up the threads stack space. It could be that you’re getting a stack overflow. Try setting your shell’s stack size limit or set the environment variable “OMP_STACKSIZE” to a large value (or unlimited).

If that doesn’t work, please compile with “-Manno -Mkeepasm” to save the generated assembly code to a “.s” file. Open the file and look for the call to “_mp_penter”. Before this call there is an assembly statement moving a data initialization symbol name like “.prvt0001” in to a register to be passed to “mp_penter”. Find this symbol towards the bottom of the file and please let me know what the data initialization values are.

For example:

##  lineno: 8
..LN3:

        xorl    %esi, %esi
        movq    .prvt0001(%rip), %rdi
        call    _mp_penter
....
.prvt0001:
        .align  8
        .long   336
        .long   0

Also, if you could rerun the program in gdb and get the dis-assembly to determine the exact instruction where the segv occurs. The values of the regesiters would also be helpful.

Thanks,
Mat

HI mkcolg,

I tested the OMP_STACKSIZE option with some absurdly large values, and it still blew up. The values you requested are below, all for the code listed last in my original post.

.prvt0001:
        .align  8
        .long   256
        .long   0
        .globl  __pgi_cu_alloc        .globl  __pgi_cu_close
        .globl  __pgi_cu_free
        .globl  __pgi_cu_downloadx
        .globl  __pgi_cu_launch2
        .globl  __pgi_cu_paramset
        .globl  __pgi_cu_datadone
        .globl  __pgi_cu_uploadx
        .globl  __pgi_cu_module_function3
        .globl  __pgi_cu_module3
        .globl  __pgi_cu_init
        .globl  _mp_pexit
        .globl  omp_get_thread_num        .globl  _mp_penter
        .globl  printf        .globl  omp_get_thread_limit
        .data        .align  8
        .globl  __pgdbg_stub
        .quad   __pgdbg_stub
        .text

Also the disassembled asm is below, it segfaulted on 0x00000000004105df.

(gdb) disassemble _mp_penter64Dump of assembler code for function _mp_penter64:
0x00000000004105c8 <_mp_penter64>:    push   %rbp
0x00000000004105c9 <_mp_penter64>:    mov    %rsp,%rbp
0x00000000004105cc <_mp_penter64>:    push   %rdi
0x00000000004105cd <_mp_penter64>:    push   %rsi
0x00000000004105ce <_mp_penter64>:    callq  0x40ed6a <_mp_init>
0x00000000004105d3 <_mp_penter64>:   pop    %rsi
0x00000000004105d4 <_mp_penter64>:   pop    %rdi
0x00000000004105d5 <_mp_penter64>:   sub    %rdi,%rsp
0x00000000004105d8 <_mp_penter64>:   sub    $0x100,%rsp

(dies here)
0x00000000004105df <_mp_penter64>:   mov    %rdi,0xb8(%rsp)

0x00000000004105e7 <_mp_penter64>:   mov    0x8(%rbp),%rdi
0x00000000004105eb <_mp_penter64>:   mov    %rdi,0xb0(%rsp)
0x00000000004105f3 <_mp_penter64>:   mov    0x0(%rbp),%rdi
0x00000000004105f7 <_mp_penter64>:   mov    %rdi,0x98(%rsp)
0x00000000004105ff <_mp_penter64>:   movq   $0x0,0xa8(%rsp)
0x000000000041060b <_mp_penter64>:   movq   $0x1,0x10(%rsp)
0x0000000000410614 <_mp_penter64>:   mov    %rsi,0x48(%rsp)
0x0000000000410619 <_mp_penter64>:   callq  0x40db7d <_mp_get_par>
0x000000000041061e <_mp_penter64>:   cmp    $0x0,%rax
0x0000000000410622 <_mp_penter64>:   jne    0x4106cd <_mp_penter64>
0x0000000000410628 <_mp_penter64>:   cmpq   $0x0,0x48(%rsp)
0x000000000041062e <_mp_penter64>:  jne    0x4106f4 <_mp_penter64>
0x0000000000410634 <_mp_penter64>:  callq  0x40db4e <_mp_get_tcpus>
0x0000000000410639 <_mp_penter64>:  cmp    $0x1,%rax
0x000000000041063d <_mp_penter64>:  je     0x4106dd <_mp_penter64>
0x0000000000410643 <_mp_penter64>:  mov    %rax,0x10(%rsp)
0x0000000000410648 <_mp_penter64>:  mov    %rbx,0x254461(%rip)        # 0x664ab0 <x_rbx>
0x000000000041064f <_mp_penter64>:  mov    %r12,0x254462(%rip)        # 0x664ab8 <x_r12>
0x0000000000410656 <_mp_penter64>:  mov    %r13,0x254463(%rip)        # 0x664ac0 <x_r13>
0x000000000041065d <_mp_penter64>:  mov    %r14,0x254464(%rip)        # 0x664ac8 <x_r14>
0x0000000000410664 <_mp_penter64>:  mov    %r15,0x254465(%rip)        # 0x664ad0 <x_r15>
0x000000000041066b <_mp_penter64>:  mov    0x98(%rsp),%rdi
0x0000000000410673 <_mp_penter64>:  mov    %rdi,0x25442e(%rip)        # 0x664aa8 <x_orbp>
0x000000000041067a <_mp_penter64>:  mov    0xb0(%rsp),%rdi
0x0000000000410682 <_mp_penter64>:  mov    %rdi,0x254417(%rip)        # 0x664aa0 <x_oret>
0x0000000000410689 <_mp_penter64>:  mov    0x10(%rsp),%rdi
0x000000000041068e <_mp_penter64>:  mov    %rdi,0x2543fb(%rip)        # 0x664a90 <x_ncpu>
0x0000000000410695 <_mp_penter64>:  mov    0xb8(%rsp),%rdi
0x000000000041069d <_mp_penter64>:  mov    %rdi,0x2543f4(%rip)        # 0x664a98 <x_priv>
0x00000000004106a4 <_mp_penter64>:  fnstcw 0x2543de(%rip)        # 0x664a88 <x_fpuc>
0x00000000004106aa <_mp_penter64>:  mov    $0x2,%rdi
0x00000000004106b1 <_mp_penter64>:  callq  0x40db5a <_mp_set_par>
0x00000000004106b6 <_mp_penter64>:  mov    $0x0,%rdi
0x00000000004106bd <_mp_penter64>:  callq  0x40e3ef <_mp_barrierr>
0x00000000004106c2 <_mp_penter64>:  movq   $0x0,0x8(%rsp)
0x00000000004106cb <_mp_penter64>:  jmp    0x4106fd <_mp_penter64>
0x00000000004106cd <_mp_penter64>:  callq  0x40dbc0 <_mp_penter_d>
0x00000000004106d2 <_mp_penter64>:  movq   $0x2,0x8(%rsp)
0x00000000004106db <_mp_penter64>:  jmp    0x4106fd <_mp_penter64>
0x00000000004106dd <_mp_penter64>:  mov    $0x1,%rdi
0x00000000004106e4 <_mp_penter64>:  callq  0x40db5a <_mp_set_par>
0x00000000004106e9 <_mp_penter64>:  movq   $0x1,0x8(%rsp)
0x00000000004106f2 <_mp_penter64>:  jmp    0x4106fd <_mp_penter64>
0x00000000004106f4 <_mp_penter64>:  movq   $0x3,0x8(%rsp)
0x00000000004106fd <_mp_penter64>:  mov    0x98(%rsp),%rbp
0x0000000000410705 <_mp_penter64>:  mov    0xb0(%rsp),%r11
0x000000000041070d <_mp_penter64>:  push   %r11
0x000000000041070f <_mp_penter64>:  retq   
End of assembler dump.

I forgot to add the register values, they are below.

(gdb) info registers
rax            0x1      1
rbx            0x2aaaaacc7bc0   46912498334656
rcx            0x0      0
rdx            0xf4240  1000000
rsi            0x0      0
rdi            0x10000000000    1099511627776
rbp            0x7fffffffded0   0x7fffffffded0
rsp            0x7effffffddd0   0x7effffffddd0
r8             0x2aaaab8c42f0   46912510903024
r9             0x7fffffffc18a   140737488339338
r10            0x0      0
r11            0x2aaaab5a46b0   46912507627184
r12            0x0      0
r13            0x7fffffffe1c0   140737488347584
r14            0x0      0
r15            0x0      0
rip            0x4105df 0x4105df <_mp_penter64>
eflags         0x10202  [ IF RF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0
fctrl          0x37f    895
fstat          0x0      0
ftag           0xffff   65535
fiseg          0x0      0
fioff          0x0      0
foseg          0x0      0
fooff          0x0      0
fop            0x0      0
mxcsr          0x1fc0   [ DAZ IM DM ZM OM UM PM ]
(dies here)
0x00000000004105df <_mp_penter64>:   mov    %rdi,0xb8(%rsp)

This is definitely a stack overflow since the segv occurs when referencing the stack pointer (rsp). Can you check your shell’s stack size limit?

  • Mat

It is set to unlimited. Note that the “limit” print from the program is the limit on number of threads, and has no relation to the stack. I’m using the zsh shell.

$ ulimit -s
unlimited
$ ./test
limit:64
[1]    15017 segmentation fault (core dumped)  ./test

This is probably a wild idea and won’t work, but it’s simple to test. You might try adding “limit stacksize unlimited” and/or “ulimit -s unlimited” to the .dotfiles of every shell you can think of (with the correct command as to that shell’s syntax). In the past, I’ve thought I was running bash, but my script was actually running a csh script unknown to me, or vice versa.

Or, perhaps, using a large, but finite number for the stacksize. I’ve run into that too, where unlimited freaks out a code, but it’s fine with 3GB of stacksize.

It was worth a try, I made sure I ran the code directly rather than submitting it through the job submission system so it would be run directly in my shell. Admittedly the system on which I can do that has no GPUs, but the results are the same whether the GPUs exist or not right now.

$ ulimit -s 3000000
$ ulimit -s
3000000
$ ./test
limit:64
[1]    31848 segmentation fault (core dumped)  ./test
$ ulimit -s 3000000000
$ ulimit -s
3000000000
$ ./test
limit:64
[1]    31856 segmentation fault (core dumped)  ./test
$ ulimit -s unlimited
$ ./test                                                                                
limit:64
[1]    31863 segmentation fault (core dumped)  ./test

Is there anything else I can provide that might help? Perhaps the binary itself, see if there’s a library conflict somewhere or some similar?

Hi njustn,

We’re at a loss. Then next step would be for us to try and install Chaos Linux here to see if we can recreate the problem, or better yet, can we try and get access to your system? If so, let me know and I’ll contact to via email. Note that I’ll be as the Supercomputing Conference next week, so it I wont be able to work on this till the week after.

  • Mat

Unfortunately I’m pretty sure there’s no way to get you access to this particular system. That said, I also will be at supercomputing, if you have time one day next week perhaps we could meet and try to work it out, as I will have access from there.

I’m going to assume that the protracted silence means nobody is interested in taking a look while we’re all in seattle. Is there anything I can provide that might help looking into this? Statically linked binary for example, or anything like that?

Hi njustn,

As we talked about at SC11, the next step will be for us to get Chaos Linux installed here to see if we can recreate the problem.

  • Mat

Hi,

On a whim, I decided to try this test again on a system I’ve been using for some months now. I should say I’ve been using it with applications that have both accelerator regions and regular parallel ones, and was actually trying to track down an issue that causes a segfault whenever you nest an “acc region” inside a parallel region without a data region as a buffer between (will post another thread about that if I can reduce the problem down), but to my utter bewilderment this one fails exactly the same way on this system, running Ubuntu lucid and the pgi 12.5 compiler, as on the original chaos based system. Thought it would be a good place to start reducing the other problem, but somehow the tiny example program copied below still gives me exactly the same error, despite the fact that I have other programs that work just fine using both.

compiled with:

pgcc -mp=allcores  -O3 -fast -Minfo=accel,mp  -DPGI -I/opt/pgi/linux86-64/2012/cuda/4.1/include -I/opt/pgi/linux86-64/2012/include_acc -ta=nvidia,keepgpu,keepptx,nofma -c99 -L/opt/pgi/linux86-64/2012/cuda/4.1/lib64 -lcuda -lcudart -lm -ldl -lcolamd /usr/lib/liblpsolve55.a   -o test test.c

code:

#define SIZE 100 
int main(int argc, char * argv[]) 
{ 
    int stuff[SIZE]; 
    int limit = omp_get_thread_limit(); 
    printf("limit:%d\n", limit); 
#pragma omp parallel shared(stuff) 
    { 
        int tid = omp_get_thread_num(); 
        printf("thread_id:%d\n", tid); 
    } 
    int i; 
#pragma acc region for copy(stuff) 
    for(i = 0; i<SIZE; i++) 
    { 
        stuff[i] = 1; 
    } 
    return 0; 
}

After running into that again, I tried a few things to see what made it work in my other applications, it appears to be that the function using an acc region has to be in a different c file… I have no clue whatsoever why, but while the above fails, this version works.

compile with:

pgcc -mp=allcores  -O3 -fast -Minfo=accel,mp  -DPGI -I/opt/pgi/linux86-64/2012/cuda/4.1/include -I/opt/pgi/linux86-64/2012/include_acc -ta=nvidia,keepgpu,keepptx,nofma -c99 -L/opt/pgi/linux86-64/2012/cuda/4.1/lib64 -lcuda -lcudart -lm -ldl -lcolamd /usr/lib/liblpsolve55.a   -o test test.c test2.c

test.c:

#define SIZE 100
void arbitraryFunc(int stuff[SIZE]);
int main(int argc, char * argv[])
{
    int stuff[SIZE] = {0};
    int limit = omp_get_thread_limit();
    printf("limit:%d\n", limit);
#pragma omp parallel
    {
        int tid = omp_get_thread_num();
        printf("thread_id:%d\n", tid);
    }
    arbitraryFunc(stuff);
    return stuff[0];
}

test2.c

#define SIZE 100
void arbitraryFunc(int stuff[SIZE]){
    int i;
#pragma data region copy(stuff)
    {
#pragma acc region for
        for(i = 0; i<SIZE; i++)
        {
            stuff[i] = 1;
        }
    }
}

Having said that I have another program, which I can send along through another channel if you like, that works with all of it in the same file with both nested together. Has anyone else run into this?

Has anyone else run into this?

Yes. We got a similar report on July 6th where the user’s code was getting a seg fault when using an OpenMP region preceded by an OpenACC region. I filed this as TPR#18802 and show it no longer occurs in our 12.6 pre-release compilers.

Though, for some reason I’m not able to recreate the issue using the code you posted. It seems to work for me no matter what compiler version or system I use. For now, let’s assume your issue is the same as TPR#18802 and that it will be fixed in 12.6. If it still fails for you 12.6, let me know and I’ll pursue it further.

  • Mat