Converting SSE+OpenMP code to OpenACC

FangQ · May 3, 2018, 5:50pm

hi forum,

we are working on porting an existing OpenMP based code to OpenACC in order to run it on the GPU.

the overall structure of the program is straightforward, however, at the inner loop of the code, we call a function that has previously been optimized using SSE4, see the function used at the core of the parallel for-loop:

https://github.com/fangq/mmc/blob/master/src/tettracing.c#L822-L1032

the main openmp loop is at

github.com

fangq/mmc/blob/master/src/mmc_host.c#L189-L213


      
              *******************************************************************************/
          
          
    /** \subsection ssimu Parallel photon transport simulation */
          
          
    #pragma omp parallel private(ran0,ran1,threadid,j)
              {
                  visitor visit = {0.f, 0.f, 1.f / cfg->tstep, DET_PHOTON_BUF, 0, 0, NULL, NULL, NULL, NULL, NULL, NULL};
                  size_t id;
          
          
#ifdef _OPENMP
                  unsigned int threadnum = omp_get_num_threads();
          #else
                  unsigned int threadnum = 1;
          #endif
          
          
        #pragma omp master
                  {
                      seeds = (unsigned int*)malloc(sizeof(int) * threadnum * RAND_SEED_WORD_LEN);
                      srand(cfg->seed);

This file has been truncated. show original

I am wondering if anyone can give us some pointers on how to convert this SSE-based function to OpenACC, here are some questions

can I directly invoke SSE calls in an OpenACC kernel? (again, the goal is to run this on the GPU)
if SSE instructions are not supported, is there a float4 class that is supported by PGI compiler?
if a float4 class is not supported, can I simply serialize each SSE4 to 4 separate component-wise calls? does that ruin my efficiency or the compiler will automatically group them into short vector operations when running on the GPU (NVIDIA and/or AMD)?

thanks, appreciate your inputs

MatColgrove · May 3, 2018, 9:10pm

Hi FengQ,

can I directly invoke SSE calls in an OpenACC kernel? (again, the goal is to run this on the GPU)

No. SSE intrinsics are only supported in x86 architectures.

Personally, I would not recommend using vector intrinsics since they limit your portability (such a Power, ARM, etc), harder to maintain (AVX, AVX-512), and most compilers are good at auto-vectorization so they aren’t really needed.

if SSE instructions are not supported, is there a float4 class that is supported by PGI compiler?

float4 isn’t an intricsic type but you can use your own struct in OpenACC code. Not sure you’d want to code it this way, but you could.

if a float4 class is not supported, can I simply serialize each SSE4 to 4 separate component-wise calls? does that ruin my efficiency or the compiler will automatically group them into short vector operations when running on the GPU (NVIDIA and/or AMD)?

While not technically correct, I tend to think of a GPU as a very large vector processor. The vector length should be a minimum of 32 (i.e. one warp) but better at 128 up to 1024. 4 would be rather small.

Do you have a version of your code that uses basic parallel loops like a reference version? If so, I’d start there.

-Mat

FangQ · May 4, 2018, 2:37am

hi Mat, thanks for the comments.

I guess different people have different preferences. I generally like to write in short vector forms, if supported (like OpenCL), because it makes the code shorter and easier to maintain (as long as auto-vectorization automatically expands it and pack adjacent instructions), but I hate SSE because it is totally unreadable.

as an example, here is the CUDA version of a core function (CUDA does not support short vec)

github.com

fangq/mcx/blob/master/src/mcx_core.cu#L276-L301


      
          * This function represents a rotation in the clockwise direction with respect
          * to an observer looking into the direction of the photon propagation.
          *
          * @param[in] s: input Stokes parameter
          * @param[in] phi: rotation angle in radians
          * @param[out] s2: output Stokes parameter
          */
          
          
__device__ inline void rotsphi(Stokes* s, float phi, Stokes* s2) {
             float sin2phi, cos2phi;
             sincosf(2.f * phi, &sin2phi, &cos2phi);
          
          
   s2->i = s->i;
             s2->q = s->q * cos2phi + s->u * sin2phi;
             s2->u = -s->q * sin2phi + s->u * cos2phi;
             s2->v = s->v;
          }
          
          
/**
          * @brief Update Stokes vector after a scattering event

This file has been truncated. show original

and here is the OpenCL version (where I can use float4 intrinsics)

github.com

fangq/mcxcl/blob/master/src/mcx_core.cl#L315-L330


      
          // OpenCL float atomicadd hack:
          // http://suhorukov.blogspot.co.uk/2011/12/opencl-11-atomic-operations-on-floating.html
          // https://devtalk.nvidia.com/default/topic/458062/atomicadd-float-float-atomicmul-float-float-/
          
          
inline float atomicadd(volatile __global float* address, const float value) {
              float old = value;
          
          
    while ((old = atomic_xchg(address, atomic_xchg(address, 0.0f) + old)) != 0.0f);
          
          
    return old;
          }
          #endif
          
          
void clearpath(__local float* p, uint maxmediatype) {
              uint i;

the OpenCL version is easier to read, and also has portable performance on both CPU and GPU.

FangQ · May 4, 2018, 2:38am

While not technically correct, I tend to think of a GPU as a very large vector processor. The vector length should be a minimum of 32 (i.e. one warp) but better at 128 up to 1024. 4 would be rather small.

the parallelism of my code largely comes from the SIMT nature of Monte Carlo simulations. it does show a non-ideal warp divergence (~62%) but I think this is limited by the randomness of the MC method itself.

within a thread, I rely on the compiler’s auto-vectorization to pack adjacent code to utilize the vector resources. I found that some compilers does a better job (like CUDA) than others in grouping instructions. For example, if I expand float3 into 3 sequential component-wise instructions, Intel OCL will fail to vectorize it, unless I write in float4 form with an extra dummy element).

I guess the take home message I heard here is that PGI’s openacc supports auto-vectorization. In that case, I should have no hesitation to expand the SSE lines into component-wise commands. it does make the code even harder to read, but I hope it won’t get a performance hit.

Do you have a version of your code that uses basic parallel loops like a reference version? If so, I’d start there.

yes, see the link above, or here

github.com

fangq/mmc/blob/master/src/mmc_host.c#L189-L213


      
              \c OMP_NUM_THREADS=3 sets the total thread number to 3.
              *******************************************************************************/
          
              /** \subsection ssimu Parallel photon transport simulation */
          
              #pragma omp parallel private(ran0,ran1,threadid,j)
              {
                  visitor visit = {0.f, 0.f, 1.f / cfg->tstep, DET_PHOTON_BUF, 0, 0, NULL, NULL, NULL, NULL, NULL, NULL};
                  size_t id;
          
          #ifdef _OPENMP
                  unsigned int threadnum = omp_get_num_threads();
          #else
                  unsigned int threadnum = 1;
          #endif
          
                  #pragma omp master
                  {
                      seeds = (unsigned int*)malloc(sizeof(int) * threadnum * RAND_SEED_WORD_LEN);
                      srand(cfg->seed);

This file has been truncated. show original