we are working on porting an existing OpenMP based code to OpenACC in order to run it on the GPU.
the overall structure of the program is straightforward, however, at the inner loop of the code, we call a function that has previously been optimized using SSE4, see the function used at the core of the parallel for-loop:
I am wondering if anyone can give us some pointers on how to convert this SSE-based function to OpenACC, here are some questions
can I directly invoke SSE calls in an OpenACC kernel? (again, the goal is to run this on the GPU)
if SSE instructions are not supported, is there a float4 class that is supported by PGI compiler?
if a float4 class is not supported, can I simply serialize each SSE4 to 4 separate component-wise calls? does that ruin my efficiency or the compiler will automatically group them into short vector operations when running on the GPU (NVIDIA and/or AMD)?
can I directly invoke SSE calls in an OpenACC kernel? (again, the goal is to run this on the GPU)
No. SSE intrinsics are only supported in x86 architectures.
Personally, I would not recommend using vector intrinsics since they limit your portability (such a Power, ARM, etc), harder to maintain (AVX, AVX-512), and most compilers are good at auto-vectorization so they aren’t really needed.
if SSE instructions are not supported, is there a float4 class that is supported by PGI compiler?
float4 isn’t an intricsic type but you can use your own struct in OpenACC code. Not sure you’d want to code it this way, but you could.
if a float4 class is not supported, can I simply serialize each SSE4 to 4 separate component-wise calls? does that ruin my efficiency or the compiler will automatically group them into short vector operations when running on the GPU (NVIDIA and/or AMD)?
While not technically correct, I tend to think of a GPU as a very large vector processor. The vector length should be a minimum of 32 (i.e. one warp) but better at 128 up to 1024. 4 would be rather small.
Do you have a version of your code that uses basic parallel loops like a reference version? If so, I’d start there.
I guess different people have different preferences. I generally like to write in short vector forms, if supported (like OpenCL), because it makes the code shorter and easier to maintain (as long as auto-vectorization automatically expands it and pack adjacent instructions), but I hate SSE because it is totally unreadable.
as an example, here is the CUDA version of a core function (CUDA does not support short vec)
and here is the OpenCL version (where I can use float4 intrinsics)
the OpenCL version is easier to read, and also has portable performance on both CPU and GPU.
While not technically correct, I tend to think of a GPU as a very large vector processor. The vector length should be a minimum of 32 (i.e. one warp) but better at 128 up to 1024. 4 would be rather small.
the parallelism of my code largely comes from the SIMT nature of Monte Carlo simulations. it does show a non-ideal warp divergence (~62%) but I think this is limited by the randomness of the MC method itself.
within a thread, I rely on the compiler’s auto-vectorization to pack adjacent code to utilize the vector resources. I found that some compilers does a better job (like CUDA) than others in grouping instructions. For example, if I expand float3 into 3 sequential component-wise instructions, Intel OCL will fail to vectorize it, unless I write in float4 form with an extra dummy element).
I guess the take home message I heard here is that PGI’s openacc supports auto-vectorization. In that case, I should have no hesitation to expand the SSE lines into component-wise commands. it does make the code even harder to read, but I hope it won’t get a performance hit.
Do you have a version of your code that uses basic parallel loops like a reference version? If so, I’d start there.