Thanks for putting this together. We very much appreciate your efforts. Hopefully, I can answer some of your questions.
The switch –Mmakedll=export_all is actually not supported by PGI which makes this whole setup doubly unsupported! However, I couldn’t find a way to export the required symbols without modifying the mex source code so I lived with it. Maybe I’ll figure out a better way in the future
The “export_all” option will work for many cases but not in all, hence it’s not officially supported. If it works, great, otherwise users need to decorate symbols using “dllexport”.
Figure out the best set of compiler switches to use– it is almost certain that what I’m using now is sub-optimal since I am new to the PGI compiler.
We recommend starting with “-fast” since in general gives the best performance. It’s really an aggregate of many flags which are adjusted for the particular target architecture. The command “pgfortran -help -fast” will list flags being used. You can try enabling and disabling each of the flags to see how they effect performance, but given the simplicity of the code, you probably not see much difference with some of them. The exception being auto-vectorization (-Mvect) which I would expect to help (more on this later).
FYI, I was going to suggest using -Msafeptr or the restrict keyword but you figured that one out already!
Figure out why the SSE version of this function is slower than the non-SSE version
I’d like you to try adding “-Mvect=simd:128”. By default we use 256 AVX on Sandy-bridge. However, our AVX vector SIN is written in 128 vector mode but used twice. I’m wonder if the slight amount of overhead is getting magnified and causing the slow-down.
If that’s not it, try “-Mvect=noaltcode” to remove the altcode generation which also adds a bit of overhead. Since the large data set will always fall into the vector code, no need to have the extra overhead.
Get OpenMP support working. I tried using the -Mconcur compilation flag which auto-parallelised the loop but it crashed MATLAB when I ran it. This needs investigating
I’m not sure about this one. It’s possible there is some conflicts between the underlying threading. Does using an explicit OpenMP pragma exhibit the have behavior?
Get PGI accelerator support working so I can offload work to the GPU.
This will be interesting. Compute regions should work fine. The difficult part (performance wise) will be if you have small problem sizes or need to share data from call to call. Though, the new OpenACC “present” directive might be able to help here. You’ll be blazing trails!
Figure out how to determine whether or not the compiler is emitting AVX instructions. The documentation suggests that if the compiler is called on a Sandy Bridge machine, and if vectorisation is possible then it will produce AVX instructions but AVX is not mentioned in the output of -Minfo. Nothing changes if you explicity set the target to Sandy Bridge with the compiler switch -tp sandybridge-64.
Sigh, this is because our engineers emit the same message for all vectorization. For a Sandy-bridge we are using AVX even though the message says sse. I blame myself for this one since I should have noticed it long ago. I guess sometimes it takes a new set of eye. I’ll ask this be corrected (TPR#18773)
Note that I’m heading out on vacation for a few weeks, but have let the application engineer who is covering the PGI UF for me of your post. He’ll respond if you have any further questions or issues.