ANSYS - There was an error while initializing the GPU library

Hi all,

Server vendor: SuperMicro
Server model: X10DRG-H
BIOS version: American Megatrends Inc. 3.0a
OS: Redhat Enterprise Linux 7.5 x64
CPU: Dual Intel® Xeon® CPU E5-2698 v4, total 40 physical cores
RAM: 1 TB
GPU: 1 NVIDIA Tesla K80 (2 GPU units)
GPU driver version: 396.37
Application: ANSYS Mechanical APDL 19.1.

Problem:
• When ANSYS Mechanical APDL 19.1 is launched using 22 or less MPI processes and 2 GPU units, everything works.
• When ANSYS Mechanical APDL 19.1 is launched using 23 or more MPI processes and 2 GPU units, it fails with error:
There was an error while initializing the GPU library. Error code = 1.
Please check your Mechanical APDL installation. In many cases,
simply rebooting your machine may help get past this error.
• When ANSYS Mechanical APDL 19.1 is launched using 11 or less MPI processes and 1 GPU unit, everything works.
• When ANSYS Mechanical APDL 19.1 is launched using 12 or more MPI processes and 1 GPU unit, it fails with save error as above.
Refer attached output files for 22C+2G and 24C+2G output files. What we need to accomplish is to run with 40 MPI processes using 2 GPU unites.
We tried below, but no luck, got same error with 24C(or higher)+2G
• renaming $HOME/.nv folder,
• setting CUDA_CACHE_DISABLE=1
• -mpi ibmmpi (for IBM MPI, the default is Intel MPI)
• Setting CUDA_DEVICE_MAX_CONNECTIONS=40
• changing working directory to NFS to local disk
Most likely something limits the max. number of MPI processes access same GPU unit simultaneously.

There is no such issue on a GPU machine in ANSYS:
Server vendor: Dell
Server model: PowerEdge R730
BIOS version: 2.3.4
OS: Redhat Enterprise Linux 6.7 x64
CPU: Dual Intel® Xeon® CPU E5-2690 v4, total 28 physical cores
RAM: 512 GB
GPU: 2 NVIDIA Tesla K80 (total 4 GPU units)
GPU driver version: 396.37
ANSYS 19.1 can run with 28C+1G, 28+2G, 28+4G without any issue.

Hopefully an CUDA expert can help to debug this issue.

Hunter

What vendor provided the GPU library that throws the error? Whoever that is should be able to tell you what possible causes lead to error code 1.

Hi njuffa,

I apologize for my ignorance, but I don’t know who provides the GPU library for my scenario. We’re using OEM NVidia drivers (the latest for the Titan V and K80 GPUs that we’re using) along with CUDA 9.0. The program we’re running is ANSYS Mechanical 19.1. Would you happen to know who would be the library supplier in the above scenario?

According to your log, the entire error message is:

The language of the error message is quite ambiguous. It does not say “GPU library failed to initialize”. It says an error occurred “while initializing the GPU library”, which could mean anything, such as failure to read a configuration file belonging to your software product, or a failing network connection. Equally mysterious is the recommendation to reboot the machine (which you have tried, I trust).

A Google search tells me that “Mechanical APDL” is an ANSYS product. So it seems to me that you should start with contacting ANSYS technical support, or filing a bug report with ANSYS. You could ask them: Under which conditions does their software throw this error message, and which GPU library is involved? I don’t see any indication that this refers to an NVIDIA-provided software component (in which case the error message might ask users to check whether the correct NVIDIA driver or CUDA version have been installed, rather than suggesting a reboot).

I Googled the error message and could find only one instance outside this thread, on a Japanese site. So I am guessing this is a rare error few people encounter.

It is indeed a very odd error. When we watch NVidia-smi while the solve is initiating, it gets all the way up to the point where it establishes all of the processes and allocates GPU memory, but it fails soon after.

Yes, we have tried rebooting, and we’ve even tried it on different machines with different GPUs (Titan V vs. K80). We have been working with Hunter from ANSYS for weeks on this issue (he’s the one that summarized the issue in the first post of this thread). After him trouble shooting the issue and bringing in the ANSYS developers; he’s indicated that we can’t make any more progress w/o the help of a CUDA subject matter expert who can help us locate where the issue is originating from.

If I had the bandwidth I’d download the CUDA toolbox and start learning how everything interacts, but at this point it’s beyond me. Any ideas?

It is always beneficial to provide all relevant context up front. You now state the ANSYS software developers have taken a look at this issue.

They should be able to see exactly where in their software stack this error is triggered. If they can trace it back conclusively to an NVIDIA software component (say, an error returned by a specific CUSPARSE API call), and the error is not triggered by the data passed to the library (say, a malformed matrix), they should file a bug with NVIDIA, as they know what software component they call in what context with what data. You do not know that, we do not know that, not do we know anything about the nature of the error (Does it indicate a resource constraint? A hardware failure? A licensing issue? Why would a reboot cure it?)

ANSYS appears to be an NVIDIA partner, or at least this page seems to suggest that: https://www.nvidia.com/object/tesla-ansys-accelerations.html If so, they should have a designated contact at NVIDIA they can use to follow up with them.

As a software vendor, the first step to bug resolution is normally independent in-house repro. To your knowledge, have ANSYS been able to achieve that? Replicating a customer’s hardware / software configuration as part of that process is often a bit of a challenge, especially if it involves uncommon components. It may take a few weeks to set up an equivalent system in-house.

Maybe ANSYS don’t have your particular hardware platform available to them at all (I have no idea how common this particular Supermicro platform is), making this especially difficult. If so, you could consider giving them access to your hardware.

My apologies for not providing all of the relevant context up front. Below is a response that I received from the developer:

“The error message “There was an error while initializing the GPU library” is caused by a non-zero error code returned from either cublasCreate or cusparseCreate, both of which are functions from NVIDIA.”

They have tried to reproduce the issue on the Dell HPC, but were not able to. Hardware access might be the next step…

One Terabyte of system memory with K80 may be a problem.

As a test, you could try reducing the system memory on the failing SMC config to 512GB and see if the issue still occurs or goes away.

https://devtalk.nvidia.com/default/topic/1017212/cuda-programming-and-performance/tesla-k40-1tb-ram-problem/

It seems @txbob posted a better idea for debugging this quasi simultaneously.

It would have been useful to include that information in the original post. Looking at the documentation. one reason for those API calls to fail is an out-of-resources condition. Not sure what those conditions would be, presumably memory is at least one of the constraints.

According to the information in the OP, the Dell machine had twice the number of GPUs than your machine, so that may be a reason the error is not reproducible. Have they tried pulling out one of the K80s in their Dell system to see whether that causes the problem to reproduce on their end?

@txbob Thank you for the suggestion. We’ll try stepping it down to 512GB and see if the issue still occurs. In the interim, can you provide a new link for that Lenovo forum? I tried clicking the link for the “1TB exactly” suggestions, but it could not find the url.

@njuffa We have not tried pulling a GPU out of the other machine, but we have tried adding a K80 to our machine, which didn’t work. As indicated above, we’re currently trying setting up to try with reduced RAM.

Thanks guys!

It looks like the lenovo forum entry is gone.

However the NVIDIA GPU Linux driver documentation covers this:

https://us.download.nvidia.com/XFree86/Linux-x86/331.20/README/addressingcapabilities.html

Here are some excerpts:

"Tesla Capabilities
1 Terabyte (40 bits)

All Tesla GPUs (minus following exceptions)"

(This is from an older driver (331.20) so it did not have Pascal/Volta etc. in view. Pascal/Volta have larger addressing capabilities in this respect)

And:

"For example, it is common for a system with 512 GB of RAM installed to have physical addresses up to ~513 GB. "

By extension, a 1 TB system may have mapped addresses in excess of 1TB. This is why even though the Tesla 40-bit case “seems” to cover 1 Terabyte, it really covers 1 Terabyte of address space, not necessarily 1 Terabyte of installed RAM. If you read the documentation, you will see that the driver may make some attempts to make this edge case workable, but nevertheless I recommend the test to rule this out as a possible contributing factor.

Bob!!! You nailed it! We just finished testing the system with 512GB and it works like a charm.

Per the prior thread you responded to on this issue, do you agree that a P100 would be able to handle the 1TB of RAM? Might be time for an upgrade…

Thanks for your help.

Tesla GPUs of the Pascal and Volta family should not be subject to the limitations associated with 1TB system memory. They have addressable ranges of over 100TB.

I recommend that Tesla GPUs only be used in properly configured systems that were qualified by the system vendor for use of that GPU. So I’m not necessarily suggesting that you can simply drop a P100 or V100 into your current system. You should check compatibility, and preferably acquire it from the system OEM properly configured with the desired GPU(s) installed.

My guess is that this system didn’t ship from SMC configured this way.

NVIDIA usually recommends V100 over P100 as a general recommendation. Yes, V100 generally costs more, but it also should be faster on a variety of workloads. Whether it is faster in a meaningful way for Ansys Mechanical or your particular workload, I can’t say. V100 carries a number of other improvements over P100, such as up to 32GB of memory, improved MPS performance, access to TensorCore for significant additional acceleration for Deep Learning workloads, and other benefits.

Choose whatever you wish, of course.