Scatter operation, how to improve performance

Dear Nvidia user,

I’m trying to improve performance of the following scatter operations:

 static void scatter_double( 
  T *restrict out, const unsigned out_stride,                      
  const T *restrict in, const unsigned in_stride,                  
   const uint *restrict map,int dstride, int mf_nt, int*mapf,       
 int vn, int m_size, int acc)                                     
{                                                                  
  uint i,j,k,dstride_in=1,dstride_out=1;                           
  if(in_stride==1)  dstride_in=dstride;                            
  if(out_stride==1) dstride_out=dstride;                           
  for(k=0;k<vn;++k) {                                              
    omp target teams distribute parallel for map(tofrom:map[0:m_size],mapf[0:2*mf_nt]) if(acc)                 
     for(i=0;i<mf_nt;i++) {                                         
       T t=in[in_stride*map[mapf[i*2]]+k*dstride_in];       
        for(j=0;j<mapf[i*2+1];j++) {                                 
        out[out_stride*map[mapf[i*2]+j+1]+k*dstride_out] = t;          
  }                                                            
}                                                              

}

According to ncu analisys, such code uses GPU only for 4.37% of computational resources, So I think I can improve performance. ncu inform me kernel grid is too small. I tried TEAMS LOOP with no performance improvement. Any suggest? I think the major problem is the striding. Thanks.



What’s “mf_nt”? If it’s small, then this would explain why there’s not enough work to utilize the GPU. To fix, you’d need a bigger size.

Also, why aren’t you parallelizing the k loop? Given the index loop-up arrays I can’t be sure it’s independent or not, but assuming it is, I’d probably try something like the following to add more parallelism.

    #pragma omp target teams loop bind(teams) map(tofrom:map[0:m_size],mapf[0:2*mf_nt]) if(acc)                 
     for(i=0;i<mf_nt;i++) {                                          
       int mapfi2 = mapf[i*2];
       int mapfi21 = mapf[i*2+1];
     #pragma omp loop bind(parallel)  
     for(k=0;k<vn;++k) { 
         T t=in[in_stride*map[mapfi2]+k*dstride_in];      
         for(j=0;j<mapfi21;j++) {                                 
             out[out_stride*map[mapfi2+j+1]+k*dstride_out] = t;          
         }                                                            
    }}

Hi Mat,

sorry for the delay in replying, but I was on vacation.

What’s “mf_nt”? If it’s small, then this would explain why there’s not enough work to utilize the GPU. To fix, you’d need a bigger size.

Typical values of mf_nt are from 7478 to 1135196

Also, why aren’t you parallelizing the k loop? Given the index loop-up arrays I can’t be sure it’s independent or not, but assuming it is, I’d probably try something like the following to add more parallelism.

I tried your solution. The problem if the code is into a macro that expand the code, this is the real code:

 #define DEFINE_SCATTER(T) \
       static void scatter_##T( \
       T *restrict out, const unsigned out_stride,                      \
       const T *restrict in, const unsigned in_stride,                  \
       const uint *restrict map,int dstride, int mf_nt, int*mapf,       \
       int vn, int m_size, int acc)                                     \
 {                                                                  \
   uint i,j,k,dstride_in=1,dstride_out=1;                           \
   if(in_stride==1)  dstride_in=dstride;                            \
  if(out_stride==1) dstride_out=dstride;                           \
  _Pragma("omp target teams distribute parallel for map(tofrom:map[0:m_size],mapf[0:2*mf_nt]) if(acc)")                  \
     for(i=0;i<mf_nt;i++) {        \
       int mapfi2 = mapf[i*2];     \
       int mapfi21 = mapf[i*2+1];  \
      _Pragma("omp loop bind(parallel) if(acc)") \
       for(k=0;k<vn;++k) {        \
          T t=in[in_stride*map[mapfi2]+k*dstride_in];   \
             for(j=0;j<mapfi21;j++) {                      \
               out[out_stride*map[mapfi2+j+1]+k*dstride_out] = t;  \
          }                                                       \
       }                                                            \
  }                                                              \

}

The code with just first pragma compiles and run well (no performance improvement). adding second pragma does not compile, giving an error of another part of the code in another file:

 "/p/scratch/prcoe05/fatigati1/nek5000/nek5000_omp_offload/core/partitioner.c", line 1: catastrophic error: cannot open source file "gslib.h"
  #include "gslib.h"

I think the problem is some macro generation. I don’t know the mechanism of such macros and how are defined as is. Maybe could you substitute such piece of code in a large code I send you some times ago? The file is jl_omp/gs_local.c:94

The inner loop will be ignored since you’ve already bound “parallel” to the outer loop. As I suggest above, change this to be “omp target teams loop bind(teams) map(tofrom:map[0:m_size],mapf[0:2*mf_nt]) if(acc)”.

Or if you really do want to use “distribute” instead of “loop”, then you’d use:

outer:
“omp target teams distribute map(tofrom:map[0:m_size],mapf[0:2*mf_nt]) if(acc)”

inner:
“omp parallel for if(acc)”

As for “gslib.h”, it is part of your source:

nek5000_omp_offload% find ./ -name "gslib.h"
./jl_omp/gslib.h
./3rd_party/gslib/include/gslib.h
./3rd_party/gslib/gslib/src/gslib.h

I don’t which one you’re supposed to use, but you’re command line is most likely missing the “-I<path_to_gsllib_dir>/”.

-Mat

The inner loop will be ignored since you’ve already bound “parallel” to the outer loop. As I suggest above, change this to be “omp target teams loop bind(teams) map(tofrom:map[0:m_size],mapf[0:2*mf_nt]) if(acc)”.

Hi Mat, using:

omp target teams loop bind(teams) map(tofrom:map[0:m_size],mapf[0:2*mf_nt]) if(acc)

I have the following error:

Fatal error: expression ‘HX_CU_CALL_CHECK(p_cuStreamSynchronize(stream[dev]))’ (value 1) is not equal to expression ‘HX_SUCCESS’ (value 0)

I never seen such error.

Ok, that means the kernel failed for some reason. It’s generic so no way of knowing exactly what the error is without additional analysis.

Given “loop” is new, it’s possible that there’s a compiler issue. Though it could be that now the inner loop is being parallelized, the error is occurring there. I’d need a reproducer to investigate. Preferable a small reproducer, but if you can get me the changes you made, I can apply them to the full app as well.

Hi Mat,

a small example is quite hard to reproduce. Meanwhile, I send you all you need to make a test. Thanks in advance.