Hi all,
In experimenting with the accelerator directives, we were unable to get a simple, “hello world” style program to compile (with the properly acclerated loop). The problem seems to lie with our dynamically allocated arrays, although I assumed this would be handled by the copyin() and copyout() mechanism. Here is the code:
#include <stdlib.h>
// compile with: pgcc test01.c -ta=nvidia -Minfo
int main (int argc, char** argv)
{
int i;
float *a, *b;
float c[1000];
a = (float*) malloc (1000 * sizeof (float));
b = (float*) malloc (1000 * sizeof (float));
for (i = 0; i < 1000; i++)
b[i] = 1;
//this region does not parallize!?
#pragma acc region copyin(b[0:999]) copyout(a[0:999])
{
for (i = 0; i < 1000; i++)
a[i] = b[i] + 1;
}
//this region does parallize??
#pragma acc region
{
for (i = 0; i < 1000; i++)
c[i] = a[i] + 1;
}
return EXIT_SUCCESS;
}
When compiling with:
pgcc test01.c -ta=nvidia -Minfo
We get the following error:
main:
14, Memory set idiom, loop replaced by call to __c_mset4
17, No parallel kernels found, accelerator region ignored
19, Complex loop carried dependence of 'b' prevents parallelization
Loop carried dependence of 'a' prevents parallelization
Loop carried backward dependence of 'a' prevents vectorization
23, Generating copyin(a[0:999])
Generating copyout(c[0:999])
Generating compute capability 1.0 binary
Generating compute capability 1.3 binary
25, Loop is parallelizable
Accelerator kernel generated
25, #pragma acc for parallel, vector(256)
CC 1.0 : 3 registers; 20 shared, 28 constant, 0 local memory bytes\
; 100 occupancy
CC 1.3 : 3 registers; 20 shared, 28 constant, 0 local memory bytes\
; 100 occupancy
Am I crazy or just blind? It seems like it should work assuming you give it the proper copy() directive to tell the compiler how big the variables are.