PGI ACC release 11.0: Multiple GPUs using openmp

i have here a very simple test program for my workstation with two GPU cards:

#include <stdio.h>
#include<stdlib.h>
#include <accel.h>

#define N  128

int main(int argc, char* argv) {

	int i; 
	
	double *restrict a;
	a=(double *restrict)malloc(N*sizeof(double));
	omp_set_num_threads(2);
	
#pragma omp parallel
{ 
	int th_id;
		th_id = omp_get_thread_num();
		printf("Hello World from thread %d\n", th_id);
}
#pragma acc region
{ 

	for (i = 0; i < N; i++) {
			a[i]*=2.0;
	} 
} 
return 0; 
}

it was than compiled using the command:

pgcc test.c -o test -mp -fast -Minfo -ta=nvidia,time

compiler(pgi acc compiler release 11.0) output without errors or warnings:

main:
     16, Parallel region activated
     21, Parallel region terminated
         Generating copy(a[0:127])
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     24, Loop is parallelizable
         Accelerator kernel generated
         24, #pragma acc for parallel, vector(128) /* blockIdx.x threadIdx.x */
             CC 1.3 : 4 registers; 28 shared, 4 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 4 shared, 40 constant, 0 local memory bytes; 66% occupancy

when i run the binary => segmentation fault

could anyone tell me what might be the problem.

Hi pengc,

I tried your code and it worked fine for me. (See below). Does it still fail if you remove the OpenMP code? Does it fail if you remove the accelerator region? What is the output of the command ‘pgaccelinfo’?

  • Mat
% pgcc test.c -o test1.out -mp -fast -Minfo -ta=nvidia,time -V11.0 ; test1.out
main:
     16, Parallel region activated
     21, Parallel region terminated
         Generating copy(a[0:127])
         Generating compute capability 1.3 binary
         Generating compute capability 2.0 binary
     24, Loop is parallelizable
         Accelerator kernel generated
         24, #pragma acc for parallel, vector(128) /* blockIdx.x threadIdx.x */
             CC 1.3 : 4 registers; 28 shared, 4 constant, 0 local memory bytes; 100% occupancy
             CC 2.0 : 8 registers; 4 shared, 40 constant, 0 local memory bytes; 66% occupancy
Hello World from thread 0
Hello World from thread 1

Accelerator Kernel Timing data
/tmp/qa/test.c
  main
    21: region entered 1 time
        time(us): total=139074 init=138626 region=448
                  kernels=22 data=49
        w/o init: total=448 max=448 min=448 avg=448
        24: kernel launched 1 times
            grid: [1]  block: [128]
            time(us): total=22 max=22 min=22 avg=22

Thanks Mat! but it still fail
output of pgaccelinfo:

-bash-3.2$ pgaccelinfo
CUDA Driver Version:           3020

Device Number:                 0
Device Name:                   GeForce GTX 480
Device Revision Number:        2.0
Global Memory Size:            1610285056
Number of Multiprocessors:     15
Number of Cores:               480
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1401 MHz
Initialization time:           1171282 microseconds
Current free memory:           1501691904
Upload time (4MB):             1030 microseconds ( 722 ms pinned)
Download time:                 1418 microseconds (1176 ms pinned)
Upload bandwidth:              4072 MB/sec (5809 MB/sec pinned)
Download bandwidth:            2957 MB/sec (3566 MB/sec pinned)

Device Number:                 1
Device Name:                   GeForce GTX 285
Device Revision Number:        1.3
Global Memory Size:            1073545216
Number of Multiprocessors:     30
Number of Cores:               240
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 16384
Registers per Block:           16384
Warp Size:                     32
Maximum Threads per Block:     512
Maximum Block Dimensions:      512, 512, 64
Maximum Grid Dimensions:       65535 x 65535 x 1
Maximum Memory Pitch:          2147483647B
Texture Alignment:             256B
Clock Rate:                    1476 MHz
Initialization time:           1171282 microseconds
Current free memory:           1032916736
Upload time (4MB):             1147 microseconds ( 799 ms pinned)
Download time:                 2230 microseconds (2028 ms pinned)
Upload bandwidth:              3656 MB/sec (5249 MB/sec pinned)
Download bandwidth:            1880 MB/sec (2068 MB/sec pinned)
CAL version:                   1.4-553
No attached CAL devices

if i remove the openmp region:

main
    16: region entered 1 time
        time(us): total=2082729 init=2082413 region=316
                  kernels=42 data=26
        w/o init: total=316 max=316 min=316 avg=316
        19: kernel launched 1 times
            grid: [1]  block: [128]
            time(us): total=42 max=42 min=42 avg=42

if i remove the acc region:

pgcc test.c -o test1.out -mp -fast -Minfo -ta=nvidia,time -V11.0 ; ./test1.out
main:
     16, Parallel region activated
     22, Parallel region terminated
Hello World from thread 1
Hello World from thread 0

it still fail if have both regions. Should there be some conflict?