I would be grateful if anyone has any clues to this, particularly whether it is an install problem or my poor OpenACC code.
Here’s the OpenACC bit:
#pragma acc data copy(arrC)
#pragma acc kernels
for(j=0;j<sz;j++){
for (i=0;i<sz;i++){
arrC[j][i] = arrA[j][i]*alpha + arrB[j][i];
}
}
arrA,B,C are all sz * sz arrays, where sz=10, so nothing huge.
The compiler generates what I’d expect
pgcc -o basic basic.c -Minfo=accel,time -acc -ta=nvidia
main:
35, Generating copy(arrC[:][:])
36, Generating copyin(arrA[:10][:10])
Generating copyin(arrB[:10][:10])
Generating compute capability 1.3 binary
Generating compute capability 2.0 binary
37, Loop is parallelizable
38, Loop is parallelizable
Accelerator kernel generated
37, #pragma acc loop gang, vector(3) /* blockIdx.y threadIdx.y */
38, #pragma acc loop gang, vector(10) /* blockIdx.x threadIdx.x */
CC 1.3 : 13 registers; 116 shared, 12 constant, 0 local memory bytes; 25% occupancy
CC 2.0 : 15 registers; 8 shared, 124 constant, 0 local memory bytes; 16% occupancy
Timing stats:
init 50 millisecs 74%
expand 17 millisecs 25%
Total time 67 millisecs
The vector sizes are quite short but this is a toy example so no problems there.
However, when I run the code I get:
./basic
call to EventSynchronize returned error 700: Launch failed
CUDA driver version: 4010
After a number of iterations, I do sometimes get it to run but the output array (arrC) has not changed (I set it to 0 before the accelerator region).
Interestingly, unsetting PGI_ACC_TIME changes this error to:
call to cuMemFree returned error 700: Launch failed
CUDA driver version: 4010
Some more iterations does eventually get it to run but still with a bad output.
-Nick.