Nvidia A5000 and CUDA 10.2 without AVX

We have an A5000 on a machine without AVX, which our platform has been developed on. The A5000 requires CUDA 11 and above, and as far as we know CUDA 11 is compiled with AVX, so we can’t use it. However we can run CUDA 10.2, but we are getting errors related to kernels [TF, PYTORCH]. And we believe this is might be because of SM86 vs. earlier versions on CUDA 10.2. We tried compiling PyTorch and TF with kernels in SM versions supported by 10.2, but we still get the same errors. Can anyone help?

AFAIK, avx is required by cudnn 8.1 and up, not cuda itself. Cuda 10 won’t work on your A5000. I guess your only chance is cuda 11.1 + cudnn 8.05 and then recompile TF/PyTorch without avx.

1 Like

Thanks - will try this and let you know.

Sorry it took me so long to get back to you.

I have installed CUDA 11.1, Cudnn 8.05 and compiled TF 2.4.4 from source without AVX instructions. I can use the GPU fine for calculations, Cudnn works too, TF detects it and I can run some basic calculations on the GPU. However I get kernel crashes each time I try to train a CNN. I have tried with code that works on another machine, and works on the same machine in CPU only mode. Would you have any idea what may cause this issue.

The logs always show the same message:

2023-03-13 12:31:13.603719: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-03-13 12:31:14.769305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:0b:00.0 name: NVIDIA RTX A5000 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 64 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 715.34GiB/s
2023-03-13 12:31:14.769365: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-03-13 12:31:14.774216: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2023-03-13 12:31:14.774293: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2023-03-13 12:31:14.775727: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-03-13 12:31:14.776064: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-03-13 12:31:14.780685: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2023-03-13 12:31:14.781758: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2023-03-13 12:31:14.781952: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-03-13 12:31:14.782487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0

warn 12:31:15.865: StdErr from Kernel Process 2023-03-13 12:31:14.792350: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2023-03-13 12:31:14.792820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:0b:00.0 name: NVIDIA RTX A5000 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 64 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 715.34GiB/s
2023-03-13 12:31:14.792860: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-03-13 12:31:14.792911: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2023-03-13 12:31:14.792957: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2023-03-13 12:31:14.793002: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-03-13 12:31:14.793048: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-03-13 12:31:14.793093: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2023-03-13 12:31:14.793139: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2023-03-13 12:31:14.793185: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-03-13 12:31:14.793667: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2023-03-13 12:31:14.793724: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-03-13 12:31:15.852201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-03-13 12:31:15.852272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2023-03-13 12:31:15.852285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2023-03-13 12:31:15.853472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22450 MB memory) -> physical GPU (device: 0, name: NVIDIA RTX A5000, pci bus id: 0000:0b:00.0, compute capability: 8.6)

warn 12:31:17.367: StdErr from Kernel Process 2023-03-13 12:31:17.367153: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)

warn 12:31:17.393: StdErr from Kernel Process 2023-03-13 12:31:17.393314: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2397175000 Hz

warn 12:31:17.908: StdErr from Kernel Process 2023-03-13 12:31:17.907978: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8

warn 12:31:22.491: StdErr from Kernel Process 2023-03-13 12:31:22.490613: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11

warn 12:31:23.855: StdErr from Kernel Process 2023-03-13 12:31:23.855176: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11

error 12:31:25.061: Disposing session as kernel process died ExitCode: undefined, Reason: /home/alexis/.virtualenvs/tf_dev/lib/python3.8/site-packages/traitlets/traitlets.py:2548: FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5.
  warn(
/home/alexis/.virtualenvs/tf_dev/lib/python3.8/site-packages/traitlets/traitlets.py:2499: FutureWarning: Supporting extra quotes around Bytes is deprecated in traitlets 5.0. Use '789e8bab-94f1-47b2-9a24-71ee36533b3c' instead of 'b"789e8bab-94f1-47b2-9a24-71ee36533b3c"'.
  warn(
2023-03-13 12:31:10.938909: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-03-13 12:31:13.601676: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2023-03-13 12:31:13.603719: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcuda.so.1
2023-03-13 12:31:14.769305: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:0b:00.0 name: NVIDIA RTX A5000 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 64 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 715.34GiB/s
2023-03-13 12:31:14.769365: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-03-13 12:31:14.774216: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2023-03-13 12:31:14.774293: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2023-03-13 12:31:14.775727: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-03-13 12:31:14.776064: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-03-13 12:31:14.780685: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2023-03-13 12:31:14.781758: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2023-03-13 12:31:14.781952: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-03-13 12:31:14.782487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2023-03-13 12:31:14.792350: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2023-03-13 12:31:14.792820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:0b:00.0 name: NVIDIA RTX A5000 computeCapability: 8.6
coreClock: 1.695GHz coreCount: 64 deviceMemorySize: 23.68GiB deviceMemoryBandwidth: 715.34GiB/s
2023-03-13 12:31:14.792860: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-03-13 12:31:14.792911: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2023-03-13 12:31:14.792957: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2023-03-13 12:31:14.793002: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2023-03-13 12:31:14.793048: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2023-03-13 12:31:14.793093: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.11
2023-03-13 12:31:14.793139: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2023-03-13 12:31:14.793185: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-03-13 12:31:14.793667: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2023-03-13 12:31:14.793724: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2023-03-13 12:31:15.852201: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-03-13 12:31:15.852272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2023-03-13 12:31:15.852285: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2023-03-13 12:31:15.853472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22450 MB memory) -> physical GPU (device: 0, name: NVIDIA RTX A5000, pci bus id: 0000:0b:00.0, compute capability: 8.6)
2023-03-13 12:31:17.367153: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2023-03-13 12:31:17.393314: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2397175000 Hz
2023-03-13 12:31:17.907978: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2023-03-13 12:31:22.490613: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2023-03-13 12:31:23.855176: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11

info 12:31:25.063: Dispose Kernel process 1169577.
type or paste code here

Also for information this is what the verification of the cudnn install throws:

Executing: mnistCUDNN
cudnnGetVersion() : 8005 , CUDNN_VERSION from cudnn.h : 8005 (8.0.5)
Host compiler version : GCC 8.4.0

There are 1 CUDA capable devices on your machine :
device 0 : sms 64  Capabilities 8.6, SmClock 1695.0 Mhz, MemSize (Mb) 24247, MemClock 8001.0 Mhz, Ecc=0, boardGroupID=0
Using device 0

Testing single precision
Loading binary file data/conv1.bin
Loading binary file data/conv1.bias.bin
Loading binary file data/conv2.bin
Loading binary file data/conv2.bias.bin
Loading binary file data/ip1.bin
Loading binary file data/ip1.bias.bin
Loading binary file data/ip2.bin
Loading binary file data/ip2.bias.bin
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.027648 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.069632 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.073728 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.115712 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.136192 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.265216 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.052224 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.102400 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.114688 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.119808 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.136192 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.157696 time requiring 1433120 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Resulting weights from Softmax:
0.0000000 0.9999399 0.0000000 0.0000000 0.0000561 0.0000000 0.0000012 0.0000017 0.0000010 0.0000000 
Loading image data/three_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.037888 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.064512 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.075776 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.076800 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.095232 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.097280 time requiring 178432 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 1433120 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.051200 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.086016 time requiring 1433120 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.091136 time requiring 2450080 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.098304 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.099328 time requiring 4656640 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.108544 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Resulting weights from Softmax:
0.0000000 0.0000000 0.0000000 0.9999288 0.0000000 0.0000711 0.0000000 0.0000000 0.0000000 0.0000000 
Loading image data/five_28x28.pgm
Performing forward propagation ...
Resulting weights from Softmax:
0.0000000 0.0000008 0.0000000 0.0000002 0.0000000 0.9999820 0.0000154 0.0000000 0.0000012 0.0000006 

Result of classification: 1 3 5

Test passed!

Testing half precision (math in single precision)
Loading binary file data/conv1.bin
Loading binary file data/conv1.bias.bin
Loading binary file data/conv2.bin
Loading binary file data/conv2.bias.bin
Loading binary file data/ip1.bin
Loading binary file data/ip1.bias.bin
Loading binary file data/ip2.bin
Loading binary file data/ip2.bias.bin
Loading image data/one_28x28.pgm
Performing forward propagation ...
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: -1.000000 time requiring 184784 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: -1.000000 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: -1.000000 time requiring 2057744 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnFindConvolutionForwardAlgorithm ...
^^^^ CUDNN_STATUS_SUCCESS for Algo 0: 0.039936 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 2: 0.070656 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 7: 0.083968 time requiring 2057744 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 1: 0.084992 time requiring 0 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 5: 0.100352 time requiring 178432 memory
^^^^ CUDNN_STATUS_SUCCESS for Algo 4: 0.108544 time requiring 184784 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 6: -1.000000 time requiring 0 memory
^^^^ CUDNN_STATUS_NOT_SUPPORTED for Algo 3: -1.000000 time requiring 0 memory
Testing cudnnGetConvolutionForwardAlgorithm_v7 ...
Illegal instruction

I guess you ran the precompiled cudnn sample, don’t know what compiler options those used. Rather compile the samples yourself. You could also run them in gdb and once it breaks, use diasm to check what “illegal instruction” is used and compare with your cpuid flags.

1 Like

Oh I am sorry, I think you overestimate my abilities! Would you mind giving me some idea how to do this? :-)
thanks,
Alexis

run
gdb --args ./mnistCUDNN
type “run”, hit enter
when it breaks,
type “disas”, hit enter
post the complete output.

This is what I get: 🧐 I have absolutely no idea what it means, though…

Dump of assembler code for function _ZN10cask_cudnn24CutlassConvolutionShaderIN7cutlass4conv6device23ImplicitGemmConvolutionI60cutlass_tensorop_f16_s16816fprop_precomputed_f16_64x64_32x10EEE9ArgumentsC2ERKNS_9Operation11DescriptionE:
   0x00007fff6003a150 <+0>:     push   %r15
   0x00007fff6003a152 <+2>:     movl   $0x0,(%rdi)
   0x00007fff6003a158 <+8>:     mov    $0x1,%eax
   0x00007fff6003a15d <+13>:    movl   $0x0,0x4(%rdi)
   0x00007fff6003a164 <+20>:    movl   $0x0,0x8(%rdi)
   0x00007fff6003a16b <+27>:    push   %r14
   0x00007fff6003a16d <+29>:    movl   $0x0,0xc(%rdi)
   0x00007fff6003a174 <+36>:    movl   $0x0,0x10(%rdi)
   0x00007fff6003a17b <+43>:    movl   $0x0,0x14(%rdi)
   0x00007fff6003a182 <+50>:    push   %r13
   0x00007fff6003a184 <+52>:    movl   $0x0,0x18(%rdi)
   0x00007fff6003a18b <+59>:    movl   $0x0,0x1c(%rdi)
   0x00007fff6003a192 <+66>:    movl   $0x0,0x20(%rdi)
   0x00007fff6003a199 <+73>:    push   %r12
   0x00007fff6003a19b <+75>:    movl   $0x0,0x24(%rdi)
   0x00007fff6003a1a2 <+82>:    movl   $0x0,0x28(%rdi)
   0x00007fff6003a1a9 <+89>:    movl   $0x1,0x2c(%rdi)
   0x00007fff6003a1b0 <+96>:    push   %rbp
   0x00007fff6003a1b1 <+97>:    movl   $0x1,0x30(%rdi)
   0x00007fff6003a1b8 <+104>:   movl   $0x1,0x34(%rdi)
   0x00007fff6003a1bf <+111>:   movl   $0x1,0x38(%rdi)
   0x00007fff6003a1c6 <+118>:   push   %rbx
   0x00007fff6003a1c7 <+119>:   movl   $0x1,0x3c(%rdi)
   0x00007fff6003a1ce <+126>:   movl   $0x1,0x40(%rdi)
   0x00007fff6003a1d5 <+133>:   movl   $0x1,0x44(%rdi)
   0x00007fff6003a1dc <+140>:   movq   $0x0,0x48(%rdi)
   0x00007fff6003a1e4 <+148>:   movl   $0x0,0x50(%rdi)
   0x00007fff6003a1eb <+155>:   movl   $0x0,0x54(%rdi)
   0x00007fff6003a1f2 <+162>:   movl   $0x0,0x58(%rdi)
   0x00007fff6003a1f9 <+169>:   movq   $0x0,0x60(%rdi)
   0x00007fff6003a201 <+177>:   movl   $0x0,0x68(%rdi)
   0x00007fff6003a208 <+184>:   movl   $0x0,0x6c(%rdi)
   0x00007fff6003a20f <+191>:   movl   $0x0,0x70(%rdi)
   0x00007fff6003a216 <+198>:   movq   $0x0,0x78(%rdi)
   0x00007fff6003a21e <+206>:   movl   $0x0,0x80(%rdi)
   0x00007fff6003a228 <+216>:   movl   $0x0,0x84(%rdi)
   0x00007fff6003a232 <+226>:   movl   $0x0,0x88(%rdi)
   0x00007fff6003a23c <+236>:   movq   $0x0,0x90(%rdi)
   0x00007fff6003a247 <+247>:   movl   $0x0,0x98(%rdi)
   0x00007fff6003a251 <+257>:   movl   $0x0,0x9c(%rdi)
   0x00007fff6003a25b <+267>:   movl   $0x0,0xa0(%rdi)
   0x00007fff6003a265 <+277>:   movl   $0x3f800000,0xa8(%rdi)
   0x00007fff6003a26f <+287>:   movl   $0x0,0xac(%rdi)
   0x00007fff6003a279 <+297>:   movq   $0x0,0xb0(%rdi)
   0x00007fff6003a284 <+308>:   movq   $0x0,0xb8(%rdi)
   0x00007fff6003a28f <+319>:   movl   $0x1,0xc0(%rdi)
   0x00007fff6003a299 <+329>:   movzbl 0x340(%rsi),%edx
   0x00007fff6003a2a0 <+336>:   mov    0x3a8(%rsi),%rcx
  0x00007fff6003a2a7 <+343>:   mov    0x3a0(%rsi),%rbp
   0x00007fff6003a2ae <+350>:   mov    0x390(%rsi),%r8
   0x00007fff6003a2b5 <+357>:   cmpq   $0x0,0x300(%rsi)
   0x00007fff6003a2bd <+365>:   cmovne 0x300(%rsi),%rax
   0x00007fff6003a2c5 <+373>:   mov    0x358(%rsi),%r9
   0x00007fff6003a2cc <+380>:   xor    $0x1,%edx
   0x00007fff6003a2cf <+383>:   mov    %rcx,-0x18(%rsp)
   0x00007fff6003a2d4 <+388>:   mov    %rbp,-0x10(%rsp)
   0x00007fff6003a2d9 <+393>:   mov    %r8,-0x8(%rsp)
   0x00007fff6003a2de <+398>:   movzbl %dl,%ebx
   0x00007fff6003a2e1 <+401>:   mov    0x388(%rsi),%rcx
   0x00007fff6003a2e8 <+408>:   mov    0x368(%rsi),%r8
   0x00007fff6003a2ef <+415>:   mov    0x200(%rsi),%rbp
   0x00007fff6003a2f6 <+422>:   mov    0x28(%rsi),%edx
   0x00007fff6003a2f9 <+425>:   mov    %ebx,-0x1c(%rsp)
   0x00007fff6003a2fd <+429>:   mov    0x208(%rsi),%r12
   0x00007fff6003a304 <+436>:   mov    0x158(%rsi),%r10
   0x00007fff6003a30b <+443>:   mov    0x160(%rsi),%r11
   0x00007fff6003a312 <+450>:   mov    0x178(%rsi),%rbx
   0x00007fff6003a319 <+457>:   mov    0x20(%rsi),%r13
   0x00007fff6003a31d <+461>:   mov    0x8(%rsi),%r14
   0x00007fff6003a321 <+465>:   mov    0x10(%rsi),%r15
   0x00007fff6003a325 <+469>:   mov    %edx,(%rdi)
   0x00007fff6003a327 <+471>:   mov    %ebp,0x14(%rdi)
   0x00007fff6003a32a <+474>:   mov    %r9d,0x24(%rdi)
   0x00007fff6003a32e <+478>:   mov    %r8d,0x28(%rdi)
   0x00007fff6003a332 <+482>:   mov    %ecx,0x2c(%rdi)
   0x00007fff6003a335 <+485>:   mov    -0x10(%rsp),%r8d
   0x00007fff6003a33a <+490>:   mov    -0x8(%rsp),%ecx
   0x00007fff6003a33e <+494>:   mov    -0x18(%rsp),%r9d
   0x00007fff6003a343 <+499>:   mov    -0x1c(%rsp),%ebp
   0x00007fff6003a347 <+503>:   mov    %r15d,0x4(%rdi)
   0x00007fff6003a34b <+507>:   mov    %r14d,0x8(%rdi)
   0x00007fff6003a34f <+511>:   mov    %ecx,0x30(%rdi)
   0x00007fff6003a352 <+514>:   mov    %r8d,0x34(%rdi)
   0x00007fff6003a356 <+518>:   mov    %r9d,0x38(%rdi)
   0x00007fff6003a35a <+522>:   mov    %ebp,0x3c(%rdi)
   0x00007fff6003a35d <+525>:   mov    %r13d,0xc(%rdi)
   0x00007fff6003a361 <+529>:   mov    %r12d,0x10(%rdi)
   0x00007fff6003a365 <+533>:   mov    %ebx,0x18(%rdi)
   0x00007fff6003a368 <+536>:   mov    %r11d,0x1c(%rdi)
   0x00007fff6003a36c <+540>:   mov    %r10d,0x20(%rdi)
   0x00007fff6003a370 <+544>:   mov    %eax,0x40(%rdi)
   0x00007fff6003a373 <+547>:   mov    0x50(%rsi),%r12
   0x00007fff6003a377 <+551>:   mov    0x68(%rsi),%rax
   0x00007fff6003a37b <+555>:   mov    0x48(%rsi),%r10
   0x00007fff6003a37f <+559>:   mov    %r12d,0x54(%rdi)
   0x00007fff6003a383 <+563>:   mov    %eax,0x58(%rdi)
   0x00007fff6003a386 <+566>:   mov    %r10d,0x50(%rdi)
   0x00007fff6003a38a <+570>:   mov    0x1a0(%rsi),%rbx
   0x00007fff6003a391 <+577>:   mov    0x198(%rsi),%r13
   0x00007fff6003a398 <+584>:   mov    0x1b8(%rsi),%r11
   0x00007fff6003a39f <+591>:   mov    %ebx,0x6c(%rdi)
   0x00007fff6003a3a2 <+594>:   mov    %r13d,0x68(%rdi)
   0x00007fff6003a3a6 <+598>:   mov    %r11d,0x70(%rdi)
   0x00007fff6003a3aa <+602>:   mov    0x260(%rsi),%r14
   0x00007fff6003a3b1 <+609>:   mov    0x248(%rsi),%r15
   0x00007fff6003a3b8 <+616>:   mov    0x240(%rsi),%rdx
   0x00007fff6003a3bf <+623>:   mov    %edx,0x80(%rdi)
   0x00007fff6003a3c5 <+629>:   mov    %r15d,0x84(%rdi)
   0x00007fff6003a3cc <+636>:   mov    %r14d,0x88(%rdi)
   0x00007fff6003a3d3 <+643>:   mov    0x260(%rsi),%r8
   0x00007fff6003a3da <+650>:   mov    0x248(%rsi),%r9
   0x00007fff6003a3e1 <+657>:   mov    0x240(%rsi),%rcx
   0x00007fff6003a3e8 <+664>:   pop    %rbx
   0x00007fff6003a3e9 <+665>:   mov    %r8d,0xa0(%rdi)
   0x00007fff6003a3f0 <+672>:   mov    %ecx,0x98(%rdi)
   0x00007fff6003a3f6 <+678>:   mov    %r9d,0x9c(%rdi)
=> 0x00007fff6003a3fd <+685>:   vmovsd 0x2d0(%rsi),%xmm0
   0x00007fff6003a405 <+693>:   vmovsd 0x2b0(%rsi),%xmm1
   0x00007fff6003a40d <+701>:   pop    %rbp
   0x00007fff6003a40e <+702>:   vcvtpd2ps %xmm0,%xmm2
   0x00007fff6003a412 <+706>:   pop    %r12
   0x00007fff6003a414 <+708>:   vcvtpd2ps %xmm1,%xmm3
   0x00007fff6003a418 <+712>:   pop    %r13
   0x00007fff6003a41a <+714>:   pop    %r14
   0x00007fff6003a41c <+716>:   pop    %r15
   0x00007fff6003a41e <+718>:   vmovss %xmm2,0xac(%rdi)
   0x00007fff6003a426 <+726>:   vmovss %xmm3,0xa8(%rdi)
   0x00007fff6003a42e <+734>:   retq   
End of assembler dump.
(gdb) `

This means that your cpu doesn’t support the vmovsd instruction which is from the SSE2 feature set. This was introduced with Pentium 4, so what kind of cpu are you using?

Looking again at the disassembly, the mnistCUDNN sample was compiled with the AVX feature set, so needs to be recompiled to work on your cpu.
This is likely irrelevant for your other issue,

error 12:31:25.061: Disposing session as kernel process died ExitCode: undefined, Reason: /home/alexis/.virtualenvs/tf_dev/lib/python3.8/site-packages/traitlets/traitlets.py:2548: FutureWarning: Supporting extra quotes around strings is deprecated in traitlets 5.0. You can use 'hmac-sha256' instead of '"hmac-sha256"' if you require traitlets >=5.
  warn(
/home/alexis/.virtualenvs/tf_dev/lib/python3.8/site-packages/traitlets/traitlets.py:2499: FutureWarning: Supporting extra quotes around Bytes is deprecated in traitlets 5.0. Use '789e8bab-94f1-47b2-9a24-71ee36533b3c' instead of 'b"789e8bab-94f1-47b2-9a24-71ee36533b3c"'.
  warn(

This rather looks like some broken/incompatible tensorflow install.
https://github.com/microsoft/vscode-jupyter/issues/5963

1 Like

This is what I have.

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   44 bits physical, 48 bits virtual
CPU(s):                          80
On-line CPU(s) list:             0-79
Thread(s) per core:              2
Core(s) per socket:              10
Socket(s):                       4
NUMA node(s):                    4
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           47
Model name:                      Intel(R) Xeon(R) CPU E7- 4870  @ 2.40GHz
Stepping:                        2
CPU MHz:                         1065.320
CPU max MHz:                     2400.0000
CPU min MHz:                     1064.0000
BogoMIPS:                        4794.35
Virtualization:                  VT-x
L1d cache:                       1.3 MiB
L1i cache:                       1.3 MiB
L2 cache:                        10 MiB
L3 cache:                        120 MiB
NUMA node0 CPU(s):               0-9,40-49
NUMA node1 CPU(s):               10-19,50-59
NUMA node2 CPU(s):               20-29,60-69
NUMA node3 CPU(s):               30-39,70-79
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Mmio stale data:   Unknown: No mitigations
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht 
                                 tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid 
                                 aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic
                                  popcnt aes lahf_lm epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm ida arat flush_l1d

Thanks - I’ll look into this one!

@generix So - I ran a simple Unet as a script (no Vs Code or Jupyter notebook this time), and I still get an Illegal Instruction. I am using TF 2.4.4, which I compiled without AVX or AVX2 instructions. Do you understand what goes wrong? This is what the gdb output throws:

Thread 262 "python3" received signal SIGILL, Illegal instruction.
[Switching to Thread 0x7ffa1cff9700 (LWP 16256)]
0x00007ff2fa14055d in cask_cudnn::CutlassConvolutionShader<cutlass::conv::device::ImplicitGemmConvolution<cutlass_tensorop_f16_s16816fprop_precomputed_f16_64x256_32x4> >::Arguments::Arguments(cask_cudnn::Operation::Description const&) ()
   from /usr/local/cuda-11.1/lib64/libcudnn_cnn_infer.so.8
(gdb) disas
Dump of assembler code for function _ZN10cask_cudnn24CutlassConvolutionShaderIN7cutlass4conv6device23ImplicitGemmConvolutionI60cutlass_tensorop_f16_s16816fprop_precomputed_f16_64x256_32x4EEE9ArgumentsC2ERKNS_9Operation11DescriptionE:
   0x00007ff2fa1402b0 <+0>:     push   %r15
   0x00007ff2fa1402b2 <+2>:     movl   $0x0,(%rdi)
   0x00007ff2fa1402b8 <+8>:     mov    $0x1,%eax
   0x00007ff2fa1402bd <+13>:    movl   $0x0,0x4(%rdi)
   0x00007ff2fa1402c4 <+20>:    movl   $0x0,0x8(%rdi)
   0x00007ff2fa1402cb <+27>:    push   %r14
   0x00007ff2fa1402cd <+29>:    movl   $0x0,0xc(%rdi)
   0x00007ff2fa1402d4 <+36>:    movl   $0x0,0x10(%rdi)
   0x00007ff2fa1402db <+43>:    movl   $0x0,0x14(%rdi)
   0x00007ff2fa1402e2 <+50>:    push   %r13
   0x00007ff2fa1402e4 <+52>:    movl   $0x0,0x18(%rdi)
   0x00007ff2fa1402eb <+59>:    movl   $0x0,0x1c(%rdi)
   0x00007ff2fa1402f2 <+66>:    movl   $0x0,0x20(%rdi)
   0x00007ff2fa1402f9 <+73>:    push   %r12
   0x00007ff2fa1402fb <+75>:    movl   $0x0,0x24(%rdi)
   0x00007ff2fa140302 <+82>:    movl   $0x0,0x28(%rdi)
   0x00007ff2fa140309 <+89>:    movl   $0x1,0x2c(%rdi)
   0x00007ff2fa140310 <+96>:    push   %rbp
   0x00007ff2fa140311 <+97>:    movl   $0x1,0x30(%rdi)
   0x00007ff2fa140318 <+104>:   movl   $0x1,0x34(%rdi)
   0x00007ff2fa14031f <+111>:   movl   $0x1,0x38(%rdi)
   0x00007ff2fa140326 <+118>:   push   %rbx
   0x00007ff2fa140327 <+119>:   movl   $0x1,0x3c(%rdi)
   0x00007ff2fa14032e <+126>:   movl   $0x1,0x40(%rdi)
   0x00007ff2fa140335 <+133>:   movl   $0x1,0x44(%rdi)
   0x00007ff2fa14033c <+140>:   movq   $0x0,0x48(%rdi)
   0x00007ff2fa140344 <+148>:   movl   $0x0,0x50(%rdi)
   0x00007ff2fa14034b <+155>:   movl   $0x0,0x54(%rdi)
   0x00007ff2fa140352 <+162>:   movl   $0x0,0x58(%rdi)
   0x00007ff2fa140359 <+169>:   movq   $0x0,0x60(%rdi)
   0x00007ff2fa140361 <+177>:   movl   $0x0,0x68(%rdi)
   0x00007ff2fa140368 <+184>:   movl   $0x0,0x6c(%rdi)
   0x00007ff2fa14036f <+191>:   movl   $0x0,0x70(%rdi)
   0x00007ff2fa140376 <+198>:   movq   $0x0,0x78(%rdi)
   0x00007ff2fa14037e <+206>:   movl   $0x0,0x80(%rdi)
   0x00007ff2fa140388 <+216>:   movl   $0x0,0x84(%rdi)
   0x00007ff2fa140392 <+226>:   movl   $0x0,0x88(%rdi)
   0x00007ff2fa14039c <+236>:   movq   $0x0,0x90(%rdi)
   0x00007ff2fa1403a7 <+247>:   movl   $0x0,0x98(%rdi)
   0x00007ff2fa1403b1 <+257>:   movl   $0x0,0x9c(%rdi)
   0x00007ff2fa1403bb <+267>:   movl   $0x0,0xa0(%rdi)
   0x00007ff2fa1403c5 <+277>:   movl   $0x3f800000,0xa8(%rdi)
   0x00007ff2fa1403cf <+287>:   movl   $0x0,0xac(%rdi)
   0x00007ff2fa1403d9 <+297>:   movq   $0x0,0xb0(%rdi)
   0x00007ff2fa1403e4 <+308>:   movq   $0x0,0xb8(%rdi)
   0x00007ff2fa1403ef <+319>:   movl   $0x1,0xc0(%rdi)
   0x00007ff2fa1403f9 <+329>:   movzbl 0x340(%rsi),%edx
   0x00007ff2fa140400 <+336>:   mov    0x3a8(%rsi),%rcx
   0x00007ff2fa140407 <+343>:   mov    0x3a0(%rsi),%rbp
   0x00007ff2fa14040e <+350>:   mov    0x390(%rsi),%r8
   0x00007ff2fa140415 <+357>:   cmpq   $0x0,0x300(%rsi)
   0x00007ff2fa14041d <+365>:   cmovne 0x300(%rsi),%rax
   0x00007ff2fa140425 <+373>:   mov    0x358(%rsi),%r9
   0x00007ff2fa14042c <+380>:   xor    $0x1,%edx
   0x00007ff2fa14042f <+383>:   mov    %rcx,-0x18(%rsp)
   0x00007ff2fa140434 <+388>:   mov    %rbp,-0x10(%rsp)
   0x00007ff2fa140439 <+393>:   mov    %r8,-0x8(%rsp)
   0x00007ff2fa14043e <+398>:   movzbl %dl,%ebx
   0x00007ff2fa140441 <+401>:   mov    0x388(%rsi),%rcx
   0x00007ff2fa140448 <+408>:   mov    0x368(%rsi),%r8
   0x00007ff2fa14044f <+415>:   mov    0x200(%rsi),%rbp
   0x00007ff2fa140456 <+422>:   mov    0x28(%rsi),%edx
   0x00007ff2fa140459 <+425>:   mov    %ebx,-0x1c(%rsp)
   0x00007ff2fa14045d <+429>:   mov    0x208(%rsi),%r12
   0x00007ff2fa140464 <+436>:   mov    0x158(%rsi),%r10
   0x00007ff2fa14046b <+443>:   mov    0x160(%rsi),%r11
   0x00007ff2fa140472 <+450>:   mov    0x178(%rsi),%rbx
   0x00007ff2fa140479 <+457>:   mov    0x20(%rsi),%r13
   0x00007ff2fa14047d <+461>:   mov    0x8(%rsi),%r14
   0x00007ff2fa140481 <+465>:   mov    0x10(%rsi),%r15
   0x00007ff2fa140485 <+469>:   mov    %edx,(%rdi)
   0x00007ff2fa140487 <+471>:   mov    %ebp,0x14(%rdi)
   0x00007ff2fa14048a <+474>:   mov    %r9d,0x24(%rdi)
   0x00007ff2fa14048e <+478>:   mov    %r8d,0x28(%rdi)
   0x00007ff2fa140492 <+482>:   mov    %ecx,0x2c(%rdi)
   0x00007ff2fa140495 <+485>:   mov    -0x10(%rsp),%r8d
   0x00007ff2fa14049a <+490>:   mov    -0x8(%rsp),%ecx
   0x00007ff2fa14049e <+494>:   mov    -0x18(%rsp),%r9d
   0x00007ff2fa1404a3 <+499>:   mov    -0x1c(%rsp),%ebp
   0x00007ff2fa1404a7 <+503>:   mov    %r15d,0x4(%rdi)
   0x00007ff2fa1404ab <+507>:   mov    %r14d,0x8(%rdi)
   0x00007ff2fa1404af <+511>:   mov    %ecx,0x30(%rdi)
   0x00007ff2fa1404b2 <+514>:   mov    %r8d,0x34(%rdi)
   0x00007ff2fa1404b6 <+518>:   mov    %r9d,0x38(%rdi)
   0x00007ff2fa1404ba <+522>:   mov    %ebp,0x3c(%rdi)
   0x00007ff2fa1404bd <+525>:   mov    %r13d,0xc(%rdi)
   0x00007ff2fa1404c1 <+529>:   mov    %r12d,0x10(%rdi)
   0x00007ff2fa1404c5 <+533>:   mov    %ebx,0x18(%rdi)
   0x00007ff2fa1404c8 <+536>:   mov    %r11d,0x1c(%rdi)
   0x00007ff2fa1404cc <+540>:   mov    %r10d,0x20(%rdi)
   0x00007ff2fa1404d0 <+544>:   mov    %eax,0x40(%rdi)
   0x00007ff2fa1404d3 <+547>:   mov    0x50(%rsi),%r12
   0x00007ff2fa1404d7 <+551>:   mov    0x68(%rsi),%rax
   0x00007ff2fa1404db <+555>:   mov    0x48(%rsi),%r10
   0x00007ff2fa1404df <+559>:   mov    %r12d,0x54(%rdi)
   0x00007ff2fa1404e3 <+563>:   mov    %eax,0x58(%rdi)
   0x00007ff2fa1404e6 <+566>:   mov    %r10d,0x50(%rdi)
   0x00007ff2fa1404ea <+570>:   mov    0x1a0(%rsi),%rbx
   0x00007ff2fa1404f1 <+577>:   mov    0x198(%rsi),%r13
   0x00007ff2fa1404f8 <+584>:   mov    0x1b8(%rsi),%r11
   0x00007ff2fa1404ff <+591>:   mov    %ebx,0x6c(%rdi)
   0x00007ff2fa140502 <+594>:   mov    %r13d,0x68(%rdi)
   0x00007ff2fa140506 <+598>:   mov    %r11d,0x70(%rdi)
   0x00007ff2fa14050a <+602>:   mov    0x260(%rsi),%r14
   0x00007ff2fa140511 <+609>:   mov    0x248(%rsi),%r15
   0x00007ff2fa140518 <+616>:   mov    0x240(%rsi),%rdx
   0x00007ff2fa14051f <+623>:   mov    %edx,0x80(%rdi)
   0x00007ff2fa140525 <+629>:   mov    %r15d,0x84(%rdi)
   0x00007ff2fa14052c <+636>:   mov    %r14d,0x88(%rdi)
   0x00007ff2fa140533 <+643>:   mov    0x260(%rsi),%r8
   0x00007ff2fa14053a <+650>:   mov    0x248(%rsi),%r9
   0x00007ff2fa140541 <+657>:   mov    0x240(%rsi),%rcx
   0x00007ff2fa140548 <+664>:   pop    %rbx
   0x00007ff2fa140549 <+665>:   mov    %r8d,0xa0(%rdi)
   0x00007ff2fa140550 <+672>:   mov    %ecx,0x98(%rdi)
   0x00007ff2fa140556 <+678>:   mov    %r9d,0x9c(%rdi)
=> 0x00007ff2fa14055d <+685>:   vmovsd 0x2d0(%rsi),%xmm0
   0x00007ff2fa140565 <+693>:   vmovsd 0x2b0(%rsi),%xmm1
   0x00007ff2fa14056d <+701>:   pop    %rbp
   0x00007ff2fa14056e <+702>:   vcvtpd2ps %xmm0,%xmm2
   0x00007ff2fa140572 <+706>:   pop    %r12
   0x00007ff2fa140574 <+708>:   vcvtpd2ps %xmm1,%xmm3
   0x00007ff2fa140578 <+712>:   pop    %r13
   0x00007ff2fa14057a <+714>:   pop    %r14
   0x00007ff2fa14057c <+716>:   pop    %r15
   0x00007ff2fa14057e <+718>:   vmovss %xmm2,0xac(%rdi)
   0x00007ff2fa140586 <+726>:   vmovss %xmm3,0xa8(%rdi)
   0x00007ff2fa14058e <+734>:   retq   
End of assembler dump.`

Looks like libcudnn_cnn_infer.so.8 is compiled with AVX support. Please check whether this really belongs to cudnn 8.0.5.

It seems so. The only occurrences are those copied from the file I downloaded from Nvidia “cudnn-11.1-linux-x64-v8.0.5.39.tar”

$ locate libcudnn_cnn_infer.so.8
/local/filespace/workspace/alexis/cudnn_download/cuda/lib64/libcudnn_cnn_infer.so.8
/local/filespace/workspace/alexis/cudnn_download/cuda/lib64/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
`

I just downloaded all cudnn 8.0.5 versions and checked the feature set used. Contrary to nvidia’s documentation, all cuda 11 versions are compiled with AVX, only the cuda 10 versions are AVX-less.

Even 8.0.2 for cuda 11 is compiled with AVX. So I guess you’re out of luck and nvidia should update their docs.

1 Like

Thanks - sad, and I suppose there is no chance they could provide a version compatible without AVX…
Thanks for all your help though.
Alexis

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.