Installation on A100 cluster

Does NVIDIA have recommendations for best practice when installing Modulus on an HPC cluster with A100s to get optimal performance?

The cluster does not support Docker; I have tried converting the Docker image to Singularity, but it fails to run.

I’ve also tried the bare metal installation, using a Conda environment, but it seems that Tensorflow 1.15 (at least the version available through Conda) doesn’t support the A100, so would be using generic kernels that don’t take advantage of all the features of the A100 to give maximal performance.

Many thanks


Here’s the steps I took to get Modulus working on a HPC cluster using V100s. The steps also worked for a machine using a singular A100 last time I checked a few months ago.

Create virtual environment for SimNet

conda create --name SimNetv21 python=3.7

conda activate SimNetv21

Install prerequisites

pip install cmake

conda install -c anaconda gxx_linux- 64

pip install horovod== 0.21

conda install -c conda-forge tensorflow-gpu= 1.15

Install SimNet

Now that the environment has been set up with the required prerequisites, you can follow the Bare metal installation instructions found within the SimNet user guide:

pip install matplotlib transforms3d future typing numpy quadpy\

numpy-stl== 2.11 . 2 h5py sympy== 1.5 . 1 termcolor\

psutil symengine== 0.6 . 1 numba Cython chaospy

pip install -U https: //

tar -xvzf ./SimNet_source.tar.gz

cd ./SimNet/

python install

To run examples using the STL point cloud generation you will need to put in your library path and install the accompanying PySDF library. This can be done by


export LD_LIBRARY_PATH=$(pwd)/SimNet/external/pysdf/build/:${LD_LIBRARY_PATH}

cd ./SimNet/external/pysdf/ python install

Adjusting SimNet

To edit SimNet code, navigate to SimNet directory, /SimNet/simnet/, then edit or replace the desired files. Then update the SimNet package with just as before

cd ./SimNet/

python install

Configuring SimNet environment for HPC

When installing SimNet to the hpc you may encounter some CUDA library issues. To resolve this, a system link can be created pointing Tensorflow to the correct location where CUDA is installed.

First create a sandbox from the container, this sandbox allows you to access all the files needed to run SimNet.

singularity build --sandbox SimNetv21_sandbox docker-archive: //simnet_image_v21.06.tar.gz

You can then upload the required CUDA files to your hpc space and then subsequently create a system link pointing to the needed CUDA library.

The system link needs to be created in each SimNet case you want to run. For example, to run the Helmholtz example you have to create a system link in that directory. To create the system link.

cd ./examples/Helmholtz

ln -s /u/… /SimNet_sandboxv21/usr/local/cuda ./cuda_sdk_lib

With the system link you can now execute training as usual.

1 Like

Hi Nason, thanks for your reply.

I’m performing the installation directly on the HPC cluster using Conda, so I’m not sure why building a Singularity image would help here. (I’m also using a Mac as my desktop, so can’t easily get a working CUDA version to test locally and then package with Singularity to put on the cluster.)

The instructions you gave closely match what I tried, but I did try them again just in case I’d missed something. I believe at least part of the problem is specific to A100s—Tensorflow 1.15 doesn’t explicitly support compute capability 8, so when it sees the A100 report its compute capability, it falls back to compute capability 3.5.

Running the Helmholtz example, after a long period of seemingly doing nothing (and not using the GPU, presumably while kernels are compiled), I get the message:

2022-03-09 15:27:23.727203: W tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/] Unknown compute capability (8, 0) .Defaulting to telling LLVM that we're compiling for sm_35

Presumably this means that significant amounts of possible performance are being left on the table, due to the improvements in later compute capabilities that are being ignored.

However, then Tensorflow reports that the results from different GEMM implementations are different (by three orders of magnitude). I believe that in general running code compiled for a low compute capability on a higher one shouldn’t give different results, so I don’t know what is causing this issue. An example of the lines I see—I see dozens like the second line, and hundreds like the first.

2022-03-09 15:27:31.730677: E tensorflow/compiler/xla/service/gpu/] Difference at 9: 1007.33 vs 0.892267
2022-03-09 15:27:31.730694: E tensorflow/compiler/xla/service/gpu/] Results mismatch between different GEMM algorithms. This is likely a bug/unexpected loss of precision in cuBLAS.

Presumably due to this failure, I then get a series of errors from layers of Tensorflow and Modulus, which are in the attached output file.

Do you/does anyone else know what specifically could cause these errors, how to get Modulus to get the full performance from the A100, and how to get it to run correctly on this example?

Many thanks


slurm-7129520.out (58.3 KB)

1 Like