I tried your code and it worked fine for me. (See below). Does it still fail if you remove the OpenMP code? Does it fail if you remove the accelerator region? What is the output of the command ‘pgaccelinfo’?
Mat
% pgcc test.c -o test1.out -mp -fast -Minfo -ta=nvidia,time -V11.0 ; test1.out
main:
16, Parallel region activated
21, Parallel region terminated
Generating copy(a[0:127])
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
24, Loop is parallelizable
Accelerator kernel generated
24, #pragma acc for parallel, vector(128) /* blockIdx.x threadIdx.x */
CC 1.3 : 4 registers; 28 shared, 4 constant, 0 local memory bytes; 100% occupancy
CC 2.0 : 8 registers; 4 shared, 40 constant, 0 local memory bytes; 66% occupancy
Hello World from thread 0
Hello World from thread 1
Accelerator Kernel Timing data
/tmp/qa/test.c
main
21: region entered 1 time
time(us): total=139074 init=138626 region=448
kernels=22 data=49
w/o init: total=448 max=448 min=448 avg=448
24: kernel launched 1 times
grid: [1] block: [128]
time(us): total=22 max=22 min=22 avg=22
Thanks Mat! but it still fail
output of pgaccelinfo:
-bash-3.2$ pgaccelinfo
CUDA Driver Version: 3020
Device Number: 0
Device Name: GeForce GTX 480
Device Revision Number: 2.0
Global Memory Size: 1610285056
Number of Multiprocessors: 15
Number of Cores: 480
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1401 MHz
Initialization time: 1171282 microseconds
Current free memory: 1501691904
Upload time (4MB): 1030 microseconds ( 722 ms pinned)
Download time: 1418 microseconds (1176 ms pinned)
Upload bandwidth: 4072 MB/sec (5809 MB/sec pinned)
Download bandwidth: 2957 MB/sec (3566 MB/sec pinned)
Device Number: 1
Device Name: GeForce GTX 285
Device Revision Number: 1.3
Global Memory Size: 1073545216
Number of Multiprocessors: 30
Number of Cores: 240
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 16384
Registers per Block: 16384
Warp Size: 32
Maximum Threads per Block: 512
Maximum Block Dimensions: 512, 512, 64
Maximum Grid Dimensions: 65535 x 65535 x 1
Maximum Memory Pitch: 2147483647B
Texture Alignment: 256B
Clock Rate: 1476 MHz
Initialization time: 1171282 microseconds
Current free memory: 1032916736
Upload time (4MB): 1147 microseconds ( 799 ms pinned)
Download time: 2230 microseconds (2028 ms pinned)
Upload bandwidth: 3656 MB/sec (5249 MB/sec pinned)
Download bandwidth: 1880 MB/sec (2068 MB/sec pinned)
CAL version: 1.4-553
No attached CAL devices
if i remove the openmp region:
main
16: region entered 1 time
time(us): total=2082729 init=2082413 region=316
kernels=42 data=26
w/o init: total=316 max=316 min=316 avg=316
19: kernel launched 1 times
grid: [1] block: [128]
time(us): total=42 max=42 min=42 avg=42
if i remove the acc region:
pgcc test.c -o test1.out -mp -fast -Minfo -ta=nvidia,time -V11.0 ; ./test1.out
main:
16, Parallel region activated
22, Parallel region terminated
Hello World from thread 1
Hello World from thread 0
it still fail if have both regions. Should there be some conflict?