we are working on porting an existing OpenMP based code to OpenACC in order to run it on the GPU.
the overall structure of the program is straightforward, however, at the inner loop of the code, we call a function that has previously been optimized using SSE4, see the function used at the core of the parallel for-loop:
the main openmp loop is at
I am wondering if anyone can give us some pointers on how to convert this SSE-based function to OpenACC, here are some questions
can I directly invoke SSE calls in an OpenACC kernel? (again, the goal is to run this on the GPU)
if SSE instructions are not supported, is there a float4 class that is supported by PGI compiler?
if a float4 class is not supported, can I simply serialize each SSE4 to 4 separate component-wise calls? does that ruin my efficiency or the compiler will automatically group them into short vector operations when running on the GPU (NVIDIA and/or AMD)?
thanks, appreciate your inputs