NVIDIA DGX A100 Problems

1] It is not supporting the older version of CUDA like 10.2,9.0.
2] While executing the updated code with CUDA11 it does not trained properly as it shows “NAN” loss values while training.

Sorry @2018eez0001 . Have you looked at https://docs.nvidia.com/cuda/ampere-compatibility-guide/ to see if that gives you any path to running your CUDA 10 code on the DGX A100? CUDA 11 is the first release that understands the A100 GPUs, hence the CUDA 11 requirement.

For the CUDA 11 training issue, I’d recommend two things:

  1. See if you can start with a framework container from NGC (ngc.nvidia.com) rather than rolling your own framework. I know this isn’t always possible, but it really is the easy way to get going.
  2. Contact NVIDIA Enterprise Support (see the pinned message in this forum) and let them help! As a DGX customer, our Enterprise Support team has the full NVIDIA behind them to try and understand what’s going on with your code, and help figure out why it’s breaking.