Clause 'Worker(value)' not allowed in 'Parallel Loop' direct

Hi,

It is still not clear to me the differences between the ‘kernels’ and the ‘parallel’ directives, so I’m trying both.

I had the impression that with the ‘parallel’ one I would have more control, but if I try to specify the number of gangs/workers/vectors I get the error:

PGF90-S-0533-Clause ‘Worker(value)’ not allowed in ‘Parallel Loop’ directive .

Changing it to ‘kernels’ like the following, this is no problem. Any reason why I cannot do it within the ‘parallel’ region?


36    !$acc kernels present(zc)   
37    !$acc loop gang(9) collapse(2)    
38    do k=kmin,kmax     
39       do kp=kmin,kmax 
40          k2=2*k  
41          km = MIN(k,kp)     
42          kp2=2*kp   
43          z0=3.d0*dble(ju2+1)*dsqrt(dble(k2+1))*dsqrt(dble(kp2+1))  
44          do q=-km,km    
45             q2=2*q    
46             
47             ! Calculate quantity C and its sum over magnetic quantum numbers               
48             !$acc loop worker(16) collapse(2)    
49             do mu2=-ju2,ju2,2     
50                do ml2=-jl2,jl2,2         
51                   p2=mu2-ml2          
52                   if(abs(p2).gt.2) cycle  
53                   z1=w3js(ju2,jl2,2,mu2,-ml2,-p2)  
54                   !$acc loop vector(32)

Another question. With the above code, the compiler gives the following info:

zcs:         
     36, Generating present(zc(:,:,:,:,:,:,:))     
     38, Loop is parallelizable   
     39, Loop is parallelizable      
            Accelerator kernel generated   
            Generating Tesla code              
            38, !$acc loop gang(9) collapse(2) ! blockidx%x      
            39,   ! blockidx%x collapsed       
            44, !$acc loop seq      
            49, !$acc loop worker(16) collapse(2) ! threadidx%y     
            50,   ! threadidx%y collapsed   
            55, !$acc loop vector(32) ! threadidx%x    
            60, !$acc loop seq             
     44, Loop carried dependence of zc prevents parallelization
            Loop carried backward dependence of zc prevents vectorization     
     49, Loop is parallelizable      
     50, Loop is parallelizable  
     55, Loop is parallelizable     
     60, Loop is parallelizable

And again, I’m not sure how to interpret the output. The compiler says “Loop is parallelizable” for all the DO loops. I understand that this is only the analysis stage, so it says that it could be parallelized, but it doesn’t mean that it has generated parallel code.

The loops in lines 38-39 have been parallelized correctly, as it says “Accelerator kernel generated”, and “Generating Tesla code”. Within the output for line 39, it also tells me that loop in line 44 and 60 will run sequentially, but also that the compiler was able to generate for loops 49-50 worker parallelism and for loop 55 vector parallelism.

But how do I interpret the rest of the lines?

44, Loop carried dependence of zc prevents parallelization

For some reason it thinks that there is dependence on zc and cannot parallelize this loop. That’s OK as far as the other loops where I put an !$acc loop get parallelized, since I know that there are no dependencies. Is this message warning that no parallelization will be done at this and lower levels or only for this particular loop?

49,50,55,60 Loop is parallelizable. So no parallel code was generated for these loops? Within the info for loop 39 it looks like it was generated, but these messages confuse me.

Sorry to nitpick, but I want to understand as best as possible all the information that the compiler provides, in order to aim for the best possible performance.

Thanks,
AdV

Hi AdV,

Any reason why I cannot do it within the ‘parallel’ region?

One of the differences between “kernels” and “parallel” is that with “parallel”, you are creating a single parallel region (i.e. a single CUDA kernel) and hence the number of workers and vectors must be the same for all worker and vector loops. With “kernels”, the compiler may create multiple parallel regions (i.e. one or more CUDA kernels) and hence the numbers of worker and vectors may vary from loop to loop.

Because setting the width with the “worker” and “vector” clause is done on a per loop basis, the width argument is not allowed within a “parallel” construct.

Instead, you should be using the “num_workers(N)” and “vector_length(N)” clause on the “parallel” directive. As of the OpenACC 2.6 standard, “num_workers” and “vector_length” were also allowed on the “kernels” directive.

Is this message warning that no parallelization will be done at this and lower levels or only for this particular loop?

Only for this particular loop. The compiler can’t prove that there isn’t a dependency so must be conservative and not parallelize it. The problem being that the code calls a function (“MIN”) to determine the loop bounds ("-km,km") so it can’t determine if all index values are unique. As you note, you can force parallelism by adding a “!$acc loop independent” directive.

49,50,55,60 Loop is parallelizable. So no parallel code was generated for these loops?

The loops are lines 49 and 50 were collapsed and parallelized using workers.

49, !$acc loop worker(16) collapse(2) ! threadidx%y
50, ! threadidx%y collapsed

Line 55 was parallelized using vectors:

55, !$acc loop vector(32) ! threadidx%x

Line 60 is parallelizable, but the three parallel levels (gang, worker, vector) have already been used, so it must be run sequentially.

-Mat

Great, many thanks for the answer.

AdV