Complex Blas and Lapack, and accelerators

A friend suggested I had a look at your compiler, and from what I read, it looks very good. There are just a few things I would like to know more about.

I have developed some code in Matlab that I wish to convert to C++ and parallelize. The code consists in a large part of matrix multiplications and “divisions” (A\B; Matlab mldivide and mrdivide, basically solving linear systems for each column of B). The matrices are mostly complex, and range from quite small to perhaps 2000x2000 elements at most, but there’s a lot of these operations, so they should be parallelized, using accelerators if possible.

Can the (complex) Blas and Lapack/ScaLapack libraries that come with PGI use accelerators?

From what I read on the forum, there is no support for the C++ complex data type when using accelerators. Do I have to use Fortran in order to get the benefit of accelerators?

Finally, I would like to get a little bit familiar with using Blas and Lapack before purchasing the PGI complier. (This is because this is really a side project and will take some time, and since new features continue to be added to the PGI compiler, it seems to make more sense to wait until my code is closer to “production ready”). It would therefore be great if you know any Blas and Lapack libraries that could be used with gcc, and which had the same calling syntax as the PGI libraries. That way it would be easier to port to PGI when the time comes. But this will perhaps create a lot of extra work for me?

Any information and advice welcome.

Regards,
Bjørn

Hi Bjørn,

Can the (complex) Blas and Lapack/ScaLapack libraries that come with PGI use accelerators?

We ship NVIDIA’s cuBLAS library, but you can also use the CULA or MAGMA libraries. Though, I’m not positive which routines are supported for using with C++ complex data types. Please consult the documentation for each library to see if they have support for the particular routines you which to use. These should be usable with GNU as well as PGI.

From what I read on the forum, there is no support for the C++ complex data type when using accelerators. Do I have to use Fortran in order to get the benefit of accelerators?

The problem is that C++ complex is implemented as a STL and not a fundamental data type. So yes, in the near term you would need to use Fortran in order to use complex data on the accelerator. Hopefully we can port the complex STL to use OpenACC in the future.

Hope this helps,
Mat

Thanks for the reply, Mat.

Since I’m starting the porting to C from scratch, I’m not tied to any specific complex data type. It would however be nice to be able to use the same datatype throughout, and that requires a complete complex math library (abs, sin, cos, tan, pow…). It is not a requirement, though, that part of the code is relatively small, and the penalty of using a different datatype for just that part (to be able to use another complex library) would be small.

MAGMA seems to be a good choice, since it exists for both NVIDIA and AMD. But as far as I can tell, separate versions are required. The ideal would be to have one program that could run Blas and Lapack on the GPU, independent of the make of the GPU. This is how OpenACC works, if I understand it correctly. But there doesn’t seem to be a linear algebra libraries that does it?

Would there be a way to make the PGI complier compile a program that will use the appropriate version of MAGMA depending on the GPU, and use the CPU if no compatible GPU exists?

Regards,
Bjørn

Another question: when using OpenACC in C, what would be the preferred choice when it comes to complex data types?

As I understand it, the great advantage of OpenACC is that the programmer doesn’t need to consider what graphics card the user has. But Cuda has one complex data type, and openCL another, from what I can see. Are they equivalent?

Regards,
Bjørn

Hi Bjørn,

Would there be a way to make the PGI complier compile a program that will use the appropriate version of MAGMA depending on the GPU, and use the CPU if no compatible GPU exists?

This wouldn’t be something the compiler could do. The library would need to either run the appropriate underlying device code or have different shared libraries that could be used depending upon the target. I’m not sure if MAGMA supports this and would suggest contacting them.

when using OpenACC in C, what would be the preferred choice when it comes to complex data types?

You can use the C99 Complex data type directly in your kernels. Though, when targeting Radeon, please use the LLVM back-end instead of OpenCL (-ta=radeon:llvm). Note that LLVM will be the default beginning with the 15.1 compiler.

But Cuda has one complex data type, and openCL another, from what I can see. Are they equivalent?

I’m not sure if they are equivalent. In OpenACC, we implemented complex as a struct of doubles and manage it ourselves instead of relying on either CUDA or OpenCL.

Note that we do have a limitation that OpenACC “routine” functions can’t return structs. Hence, you can’t have OpenACC “routine” that returns a complex.

-Mat