ANSYS - There was an error while initializing the GPU library

jason.turner · August 16, 2018, 9:55pm

Hi all,

Server vendor: SuperMicro
Server model: X10DRG-H
BIOS version: American Megatrends Inc. 3.0a
OS: Redhat Enterprise Linux 7.5 x64
CPU: Dual Intel(R) Xeon(R) CPU E5-2698 v4, total 40 physical cores
RAM: 1 TB
GPU: 1 NVIDIA Tesla K80 (2 GPU units)
GPU driver version: 396.37
Application: ANSYS Mechanical APDL 19.1.

Problem:
• When ANSYS Mechanical APDL 19.1 is launched using 22 or less MPI processes and 2 GPU units, everything works.
• When ANSYS Mechanical APDL 19.1 is launched using 23 or more MPI processes and 2 GPU units, it fails with error:
There was an error while initializing the GPU library. Error code = 1.
Please check your Mechanical APDL installation. In many cases,
simply rebooting your machine may help get past this error.
• When ANSYS Mechanical APDL 19.1 is launched using 11 or less MPI processes and 1 GPU unit, everything works.
• When ANSYS Mechanical APDL 19.1 is launched using 12 or more MPI processes and 1 GPU unit, it fails with save error as above.
Refer attached output files for 22C+2G and 24C+2G output files. What we need to accomplish is to run with 40 MPI processes using 2 GPU unites.
We tried below, but no luck, got same error with 24C(or higher)+2G
• renaming $HOME/.nv folder,
• setting CUDA_CACHE_DISABLE=1
• -mpi ibmmpi (for IBM MPI, the default is Intel MPI)
• Setting CUDA_DEVICE_MAX_CONNECTIONS=40
• changing working directory to NFS to local disk
Most likely something limits the max. number of MPI processes access same GPU unit simultaneously.

There is no such issue on a GPU machine in ANSYS:
Server vendor: Dell
Server model: PowerEdge R730
BIOS version: 2.3.4
OS: Redhat Enterprise Linux 6.7 x64
CPU: Dual Intel(R) Xeon(R) CPU E5-2690 v4, total 28 physical cores
RAM: 512 GB
GPU: 2 NVIDIA Tesla K80 (total 4 GPU units)
GPU driver version: 396.37
ANSYS 19.1 can run with 28C+1G, 28+2G, 28+4G without any issue.

Hopefully an CUDA expert can help to debug this issue.

Hunter

njuffa · August 17, 2018, 9:16am

What vendor provided the GPU library that throws the error? Whoever that is should be able to tell you what possible causes lead to error code 1.

jason.turner · August 17, 2018, 3:14pm

Hi njuffa,

I apologize for my ignorance, but I don’t know who provides the GPU library for my scenario. We’re using OEM NVidia drivers (the latest for the Titan V and K80 GPUs that we’re using) along with CUDA 9.0. The program we’re running is ANSYS Mechanical 19.1. Would you happen to know who would be the library supplier in the above scenario?

njuffa · August 17, 2018, 3:34pm

According to your log, the entire error message is:

The language of the error message is quite ambiguous. It does not say “GPU library failed to initialize”. It says an error occurred “while initializing the GPU library”, which could mean anything, such as failure to read a configuration file belonging to your software product, or a failing network connection. Equally mysterious is the recommendation to reboot the machine (which you have tried, I trust).

A Google search tells me that “Mechanical APDL” is an ANSYS product. So it seems to me that you should start with contacting ANSYS technical support, or filing a bug report with ANSYS. You could ask them: Under which conditions does their software throw this error message, and which GPU library is involved? I don’t see any indication that this refers to an NVIDIA-provided software component (in which case the error message might ask users to check whether the correct NVIDIA driver or CUDA version have been installed, rather than suggesting a reboot).

I Googled the error message and could find only one instance outside this thread, on a Japanese site. So I am guessing this is a rare error few people encounter.

jason.turner · August 17, 2018, 4:29pm

It is indeed a very odd error. When we watch NVidia-smi while the solve is initiating, it gets all the way up to the point where it establishes all of the processes and allocates GPU memory, but it fails soon after.

Yes, we have tried rebooting, and we’ve even tried it on different machines with different GPUs (Titan V vs. K80). We have been working with Hunter from ANSYS for weeks on this issue (he’s the one that summarized the issue in the first post of this thread). After him trouble shooting the issue and bringing in the ANSYS developers; he’s indicated that we can’t make any more progress w/o the help of a CUDA subject matter expert who can help us locate where the issue is originating from.

If I had the bandwidth I’d download the CUDA toolbox and start learning how everything interacts, but at this point it’s beyond me. Any ideas?

njuffa · August 17, 2018, 4:50pm

It is always beneficial to provide all relevant context up front. You now state the ANSYS software developers have taken a look at this issue.

They should be able to see exactly where in their software stack this error is triggered. If they can trace it back conclusively to an NVIDIA software component (say, an error returned by a specific CUSPARSE API call), and the error is not triggered by the data passed to the library (say, a malformed matrix), they should file a bug with NVIDIA, as they know what software component they call in what context with what data. You do not know that, we do not know that, not do we know anything about the nature of the error (Does it indicate a resource constraint? A hardware failure? A licensing issue? Why would a reboot cure it?)

ANSYS appears to be an NVIDIA partner, or at least this page seems to suggest that: [url]https://www.nvidia.com/object/tesla-ansys-accelerations.html[/url] If so, they should have a designated contact at NVIDIA they can use to follow up with them.

As a software vendor, the first step to bug resolution is normally independent in-house repro. To your knowledge, have ANSYS been able to achieve that? Replicating a customer’s hardware / software configuration as part of that process is often a bit of a challenge, especially if it involves uncommon components. It may take a few weeks to set up an equivalent system in-house.

Maybe ANSYS don’t have your particular hardware platform available to them at all (I have no idea how common this particular Supermicro platform is), making this especially difficult. If so, you could consider giving them access to your hardware.

jason.turner · August 17, 2018, 6:38pm

My apologies for not providing all of the relevant context up front. Below is a response that I received from the developer:

“The error message “There was an error while initializing the GPU library” is caused by a non-zero error code returned from either cublasCreate or cusparseCreate, both of which are functions from NVIDIA.”

They have tried to reproduce the issue on the Dell HPC, but were not able to. Hardware access might be the next step…

Robert_Crovella · August 17, 2018, 6:49pm

One Terabyte of system memory with K80 may be a problem.

As a test, you could try reducing the system memory on the failing SMC config to 512GB and see if the issue still occurs or goes away.

[url]https://devtalk.nvidia.com/default/topic/1017212/cuda-programming-and-performance/tesla-k40-1tb-ram-problem/[/url]

njuffa · August 17, 2018, 6:50pm

It seems @txbob posted a better idea for debugging this quasi simultaneously.

It would have been useful to include that information in the original post. Looking at the documentation. one reason for those API calls to fail is an out-of-resources condition. Not sure what those conditions would be, presumably memory is at least one of the constraints.

According to the information in the OP, the Dell machine had twice the number of GPUs than your machine, so that may be a reason the error is not reproducible. Have they tried pulling out one of the K80s in their Dell system to see whether that causes the problem to reproduce on their end?

jason.turner · August 17, 2018, 8:45pm

@txbob Thank you for the suggestion. We’ll try stepping it down to 512GB and see if the issue still occurs. In the interim, can you provide a new link for that Lenovo forum? I tried clicking the link for the “1TB exactly” suggestions, but it could not find the url.

@njuffa We have not tried pulling a GPU out of the other machine, but we have tried adding a K80 to our machine, which didn’t work. As indicated above, we’re currently trying setting up to try with reduced RAM.

Thanks guys!

Robert_Crovella · August 17, 2018, 9:04pm

It looks like the lenovo forum entry is gone.

However the NVIDIA GPU Linux driver documentation covers this:

https://us.download.nvidia.com/XFree86/Linux-x86/331.20/README/addressingcapabilities.html

Here are some excerpts:

"Tesla Capabilities
1 Terabyte (40 bits)

All Tesla GPUs (minus following exceptions)"

(This is from an older driver (331.20) so it did not have Pascal/Volta etc. in view. Pascal/Volta have larger addressing capabilities in this respect)

And:

"For example, it is common for a system with 512 GB of RAM installed to have physical addresses up to ~513 GB. "

By extension, a 1 TB system may have mapped addresses in excess of 1TB. This is why even though the Tesla 40-bit case “seems” to cover 1 Terabyte, it really covers 1 Terabyte of address space, not necessarily 1 Terabyte of installed RAM. If you read the documentation, you will see that the driver may make some attempts to make this edge case workable, but nevertheless I recommend the test to rule this out as a possible contributing factor.

jason.turner · August 17, 2018, 10:40pm

Bob!!! You nailed it! We just finished testing the system with 512GB and it works like a charm.

Per the prior thread you responded to on this issue, do you agree that a P100 would be able to handle the 1TB of RAM? Might be time for an upgrade…

Thanks for your help.

Robert_Crovella · August 18, 2018, 1:04am

Tesla GPUs of the Pascal and Volta family should not be subject to the limitations associated with 1TB system memory. They have addressable ranges of over 100TB.

I recommend that Tesla GPUs only be used in properly configured systems that were qualified by the system vendor for use of that GPU. So I’m not necessarily suggesting that you can simply drop a P100 or V100 into your current system. You should check compatibility, and preferably acquire it from the system OEM properly configured with the desired GPU(s) installed.

My guess is that this system didn’t ship from SMC configured this way.

NVIDIA usually recommends V100 over P100 as a general recommendation. Yes, V100 generally costs more, but it also should be faster on a variety of workloads. Whether it is faster in a meaningful way for Ansys Mechanical or your particular workload, I can’t say. V100 carries a number of other improvements over P100, such as up to 32GB of memory, improved MPS performance, access to TensorCore for significant additional acceleration for Deep Learning workloads, and other benefits.

Choose whatever you wish, of course.

Topic		Replies	Views
Multi-GPU Peer to Peer access failing on Tesla K80 CUDA Programming and Performance	25	25733	November 24, 2016
K80 crashed or wrong computation results on K80 CUDA Programming and Performance	13	4971	September 20, 2015
Driver Installing Problem for NVIDIA Tesla K80 under Linux CUDA Programming and Performance	10	20641	August 16, 2015
Driver Installation for Tesla K80 - Problems CUDA Setup and Installation	17	6612	January 18, 2020
GPU Memory Less Than Promised CUDA Programming and Performance	19	3081	December 15, 2022
K20 with high utilization, but no compute processes. CUDA Setup and Installation	12	26701	March 19, 2015
K80 application clock limited to 562 Mhz CUDA Setup and Installation	18	4238	March 2, 2021
Tesla K80 detected on OpenSuse 15.5, but nvidia-smi couldn't communicate with the NVIDIA driver Linux driver	8	1354	June 18, 2023
Tesla P40 in Dell Percision 7910 rack CUDA Programming and Performance	10	2309	February 16, 2024
Nvidia process not running Linux	25	2854	December 31, 2021

ANSYS - There was an error while initializing the GPU library

Related topics