Invalid cast opcode for typedef variable

Hello,

I’ve tried to compile to following code using PGI compiler from SDK/22.7
It seems that PGI is trying to implicitly convert user-defined type to pointer type.
The relevant definition is:

typedef struct complextype
{
	float real, imag;
} Compl;

Which produces the following error message:

$ pgcc -mp -acc -Minfo=accel  mandelbrot_openacc5.c 

main:
     48, Generating create(res[:][:]) [if not already present]
     54, Generating implicit private(j)
         Generating NVIDIA GPU code
         54, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
         55, #pragma acc loop seq
     54, Local memory used for z,c
     55, Complex loop carried dependence of __nv_main_F1L44_2_3->->,res prevents parallelization
         Loop carried scalar dependence for j at line 59,72,73,75
         Generating implicit private(j)
         Complex loop carried dependence of __nv_main_F1L44_2_3->-> prevents parallelization
     64, Loop carried scalar dependence for k at line 68
         Scalar last value needed after loop for k at line 72
     79, Generating update self(res[start:block_size][:])
nvvmCompileProgram error 9: NVVM_ERROR_COMPILATION.
Error: /tmp/pgaccXDWmtiOa2yOg.gpu (123, 23): parse invalid cast opcode for cast from '%struct.DT1_284 = type <{ float, float }>' to 'float*'
NVC++-F-0155-Compiler failed to translate accelerator region (see -Minfo messages): Device compiler exited with error status code (mandelbrot_openacc5.c: 1)
NVC++/x86-64 Linux 22.7-0: compilation aborted

I do not observe this issue with PGI/19.1.
Perhaps treatment of typedef variable has been changed with recent version of PGI.
Some clarification regarding this issue is much appreciated.

mandelbrot_openacc5.c (2.2 KB)

These are the “z” and “c” variables which are being privatized. Basically an array of these variables, one per thread, is getting allocated on each device. So it’s not that these get converted to pointer types, but rather it’s a pointer to an array of these types.

The error is a device code generation problem. The issue being the privatization of OpenMP shared variables. The work around would be to either declare z and c locally within loop body so they are local variables instead of private, or add them to a private clause on the OpenMP parallel construct.

For example:

      int end   = start+(Y_RESN/num_blocks);
      #pragma acc parallel loop private(k,temp,lengthsq) async(block%2)
          for(i=start; i < end; i++) {
                for(j=0; j < X_RESN; j++) {
                        Compl   z, c;
                        z.real = 0.0;
                        z.imag = 0.0;

or

#pragma omp parallel num_threads(num_gpus) private(z,c)
{
    int my_gpu=omp_get_thread_num();
    acc_set_device_num(my_gpu,acc_device_nvidia);

Note that I typically recommend folks use MPI rather than OpenMP for multi-GPU support. With OpenMP, you need to do the domain decomposition which isn’t natural as opposed to MPI where it is. Not too difficult with this example, but becomes difficult for larger codes.

Second with OpenMP your limited to a single system, while MPI allows you to go multi-node and you can use CUDA Aware MPI to improve communication between. In other words, if you’re going through the trouble of doing the domain decomposition, you should get the benefit of being able to do run mulit-node.

Finally, with OpenMP it can be very tricky to merge a shared host array from the multiple copies of arrays, one on each GPU, as well as wasting memory on the GPU for portions of the array that isn’t used. Managing halos, are particularly difficult to get correct.

Hope this helps,
Mat

Mat,

Thanks very much for detailed answers.
The code can now be compiled without issue.
I will heed your advice regarding the usage of OpenMP.

Regards.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.