13.8 Unexpected load/store type when use cache

Hi,
i’m using pgcc 13.8-0 64-bit target on x86-64.
I test matrix mul cache sample and that work well.
But In my program, compiler report

"PGC-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected load/store type (Test.c: 37)"



void compute(   double * profile, double * matrix, double * mean, 
                double * result, const int n_sample, const int gene_num, 
                const int cluster_num) {
        const int column = n_sample * n_sample;

#pragma acc data copyin( profile[0:gene_num*n_saple], \
                         matrix[0:cluster_num*column], \
                         mean[0:cluster_num*n_sample] ), \
                 copyout( result[0:gene_num*cluster_num] )
        
    #pragma acc kernels loop independent 
    for(int g = 0; g < gene_num; ++g) {
                
        #pragma acc loop independent  
        for(int c = 0; c < cluster_num; ++c) {
            double tmp = 0;

            #pragma acc cache(profile[g*n_sample:n_sample])
            #pragma acc loop reduction(+:tmp) 
            for(int i = 0; i < n_sample; ++i) {
                double t = 0.0;
                for(int j = 0; j < n_sample; ++j) {
                    t += ( profile[g*n_sample+j] - mean[c*n_sample+j] ) *
                            matrix[c*column+j * n_sample +i];
                }
                tmp += t * (profile[g*n_sample+i] - mean[c*n_sample+i]);
            }
            result[g * cluster_num + c] = tmp;
        }
    }

}

And Line 37 is the first loops’ directive

 #pragma acc kernels loop independent

And Minfo

PGC-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected load/store type (Test.c: 37)
compute:
     35, Generating copyin(profile[0:n_sample*gene_num])
         Generating copyin(matrix[0:column*cluster_num])
         Generating copyin(mean[0:n_sample*cluster_num])
         Generating copyout(result[0:gene_num*cluster_num])
     38, Loop is parallelizable
     40, Loop is parallelizable
         Accelerator kernel generated
         38, #pragma acc loop gang /* blockIdx.y */
         40, #pragma acc loop gang /* blockIdx.x */
         49, #pragma acc loop vector(128) /* threadIdx.x */
         Loop is parallelizable
     51, Loop is parallelizable
PGC/x86-64 Linux 13.8-0: compilation completed with warnings

And if I replace

#pragma acc cache(profile[g*n_sample:n_sample])

with

#pragma acc cache(profile[0:10])

Minfo is

PGC-W-0155-Compiler failed to translate accelerator region (see -Minfo messages): Unexpected load/store type (Test.c: 37)
compute:
     35, Generating copyin(profile[0:n_sample*gene_num])
         Generating copyin(matrix[0:column*cluster_num])
         Generating copyin(mean[0:n_sample*cluster_num])
         Generating copyout(result[0:gene_num*cluster_num])
     38, Loop is parallelizable
     40, Loop is parallelizable
         Accelerator kernel generated
         38, #pragma acc loop gang /* blockIdx.y */
             Cached references to size [11] block of 'profile'
         40, #pragma acc loop gang /* blockIdx.x */
         49, #pragma acc loop vector(128) /* threadIdx.x */
         Loop is parallelizable
     51, Loop is parallelizable
PGC/x86-64 Linux 13.8-0: compilation completed with warnings

What’s wrong with my code as the cache problem seems had been solved in version 13.5.
Thank you in advance.
luxuia

Hi luxuia,

This is known issue (TPR#19395) that we have an engineer actively working on. Unfortunately, it doesn’t look like he’ll have in fixed in time for the 13.9, but hopefully by 13.10. In the mean time, please comment out the “cache” directive.

Best Regards,
Mat

Hi mkcolg,

Oh, my heart breaking…
The in-chip memory do huge contribution in GPU computing.
So, The ‘cache’ issue is arise in 13 or all version?
Maybe I can try older version for better performance :).
Meanwhile, TPRs in http://www.pgroup.com/support/release_tprs_2013.htm is problem fixed, not problem find, is it?
Thanks for you help.

Best Regards,
Luxuia

Hi Luxuia,

The in-chip memory do huge contribution in GPU computing.

Actually, software managed shared memory is only critical with the C1060 (cc1.3). For later devices, NVIDIA added hardware caching which diminishes the need for the program to manage this memory. It still can help, but is not as much as before.

So, The ‘cache’ issue is arise in 13 or all version?

While this particular error is fairly new (around 13.4 I think), the cache directive has been problematic. We’re working on it.

Maybe I can try older version for better performance :).

Possibly, but for a different reason then the cache directive. In 13.1 we started to pin memory for better performance of the data transfer. Unfortunately this actually causes a slow down when there’s many free’s since the CUDA routine which frees pinned memory needs to synchronize with the device. In 13.9, we added a new method by default and then added a “-ta=nvidia,pin” flag to use the pinned memory method (which does help in some cases)

Meanwhile, TPRs in > http://www.pgroup.com/support/release_tprs_2013.htm > is problem fixed, not problem find, is it?

No, as I said above, the problem is not fixed in a release, but I hope it will be by 13.10.

  • Mat

Hi mkcolg,

Thank you very much.
I will test 13.1 and update state. :)

Thanks again for your best patient.

Hi mkcolg,

Sorry for disturb you again.
But It seems wrong whatever the pgi compiler version is. I tested version 12.10 and 13.1. Both of them pop the wrong info as 13.8.

I have to write cuda kernel -.-|||, and that works well with x60 speed up compare to CPU version. And the openacc version without cache can attain x20 speed up.

Thanks for your hard working any way.

Best Regards,
-luxuia

Hi luxuia,

FYI, It looks like this fix did make it into 13.9.

  • Mat

luxuia,

TPR 19395 - OACC: “unexpected store/load type” error when using the cache directive

is fixed in the current 14.3 release.

thanks,
dave

luxuia,

TPR 19395 - OACC: “unexpected store/load type” error when using the cache directive

is fixed in the current 14.3 release. Be sure to use CUDA 5.5 for this
fix to work.

thanks,
dave