General Doubt Regarding Frameworks Support Matrix of NGC Containers

Hi,

I have a general doubt regarding using NGC Container on any Computer. I had a requirement for using CUDA 9.2 and Pytorch 0.4.1 with Ubuntu 16.04 on my Geforce RTX 3080 Laptop (based on Ampere arch i guess). According to requirements i chose
Container: nvcr.io/nvidia/pytorch:18.06-py3 (Supports Volta and Pascal architecture)
for my experiment even though it did not support Ampere Hardware.

I found the following observations rather fishy:

  • After building the Docker image, I found that torch was installed inside a conda environment but with version 0.5.1 instead of 0.4.1 as mentioned in the Frameworks Support Matrix.

  • Running pytorch on cuda works. I confirmed using torch.cuda.is_available() which return true. Which means Pytorch recognizes that the machine has GPU.

  • Running a convolution network freezes at nn.Conv2d() line and returns CUDNN_STATUS_EXECUTION_FAILED after 5 or 10 minutes.

  • Running the same model on CPU runs without any problem even though it is damn slow.

I would like to know if this is the problem of CuDNN mismatch with the hardware. If yes, is there a way to get an image with my requirements for Ampere Architecture.

Thanks a lot for the help,
Jeethesh

In my limited experience, the documentation for the containers in regard to environment details is flat wrong. The Pytorch container, for example, doesn’t have a conda environment, doesn’t have python 3.5, doesn’t have python 3.8. These are all variously claimed in different documents in and out of the container. In truth, it has a system Python 3.10 with all the installations.