Basic Accelerator trouble with dynamically alloc arrays [C]

maximilian · October 7, 2010, 2:19pm

Hi all,

In experimenting with the accelerator directives, we were unable to get a simple, “hello world” style program to compile (with the properly acclerated loop). The problem seems to lie with our dynamically allocated arrays, although I assumed this would be handled by the copyin() and copyout() mechanism. Here is the code:

#include <stdlib.h>                                                             
                                                                                
// compile with: pgcc test01.c -ta=nvidia -Minfo                                
                                                                                
int main (int argc, char** argv)                                                
{                                                                               
    int i;                                                                      
    float *a, *b;                                                               
    float c[1000];                                                              
                                                                                
    a = (float*) malloc (1000 * sizeof (float));                                
    b = (float*) malloc (1000 * sizeof (float));                                
                                                                                
    for (i = 0; i < 1000; i++)                                                  
        b[i] = 1;                                                             
    
    //this region does not parallize!?                                                                            
    #pragma acc region copyin(b[0:999]) copyout(a[0:999])                       
    {                                                                           
        for (i = 0; i < 1000; i++)                                              
            a[i] = b[i] + 1;                                                    
    }                                                                           
    //this region does parallize??                                                                            
    #pragma acc region                                                          
    {                                                                           
        for (i = 0; i < 1000; i++)                                              
            c[i] = a[i] + 1;                                                    
    }                                                                           
                                                                                
    return EXIT_SUCCESS;                                                        
}

When compiling with:

pgcc test01.c -ta=nvidia -Minfo

We get the following error:

main:                                                                           
     14, Memory set idiom, loop replaced by call to __c_mset4                   
     17, No parallel kernels found, accelerator region ignored                  
     19, Complex loop carried dependence of 'b' prevents parallelization        
         Loop carried dependence of 'a' prevents parallelization                
         Loop carried backward dependence of 'a' prevents vectorization         
     23, Generating copyin(a[0:999])                                            
         Generating copyout(c[0:999])                                           
         Generating compute capability 1.0 binary                               
         Generating compute capability 1.3 binary                               
     25, Loop is parallelizable                                                 
         Accelerator kernel generated                                           
         25, #pragma acc for parallel, vector(256)                              
             CC 1.0 : 3 registers; 20 shared, 28 constant, 0 local memory bytes\
; 100 occupancy                                                                 
             CC 1.3 : 3 registers; 20 shared, 28 constant, 0 local memory bytes\
; 100 occupancy

Am I crazy or just blind? It seems like it should work assuming you give it the proper copy() directive to tell the compiler how big the variables are.

MatColgrove · October 9, 2010, 8:22am

Hi maximilian,

You need to declare your pointers using the C99 restrict keyword or compile with “-Msafeptr”. Otherwise, the compiler must assume that pointers overlap and hence cannot be parallelzied.

 float * restrict a, * restrict b;

Also, you may wish to use the flag “-Mfcon” or append “f” to your constant real values to make them single precision. By default, constant reals are double precision.

Hope this helps,
Mat