Help appreciated: Trouble setting up CUDA 11.2 on Windows 10 (3090)

Hello all, I’m in need of some advice, and hopefully this is the right place to ask! I’m a new developer and have just recently tried to set up my first GPU for my DL rig (RTX 3090)

I’m excited to get going with some TensorFlow tasks with the 3090, but I’m having difficulty with the GPU/CUDA/CUDNN setup.

My OS is Windows 10 Education.

So far, I’ve done the following:

  • Downloaded and installed the latest version of CUDA (v11.2)

    • In Control Panel > System > Advanced > Environment Variables > System Variables, I’ve set the system path to include:
      • C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\bin
      • C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\libnvvp
      • C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\extras\CUPTI\lib64
      • C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.2\include
  • Downloaded the latest CUDNN files (v8.1) and copied the contents to their respective /lib, /include, and /bin directories

To check the installation, I ran >nvcc -V in Command Prompt, which returned the correct version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Jan_28_19:41:49_Pacific_Standard_Time_2021
Cuda compilation tools, release 11.2, V11.2.142
Build cuda_11.2.r11.2/compiler.29558016_0

However, when I try to check for my GPU to be recognized by running deviceQuery.exe
in the provided C:\ProgramData\NVIDIA Corporation\CUDA Samples\v11.2\bin\win64\Debug
directory, the executable flashes a window for about a millisecond, then immediately closes before I can read it. From the docs, I expected to get a window of my GPU specs to confirm that CUDA can see my GPU.

Does this mean that my GPU is not visible? I’m a bit worried that the device query doesn’t behave as expected.

Any advice is greatly appreciated. Thank you in advance!

you need to build devicequery using visual studio then locate the exe in /release (not /debug) open cmd inside that directory then type deviceQuery.exe

similar question here

The pre-built deviceQuery.exe that ships with CUDA is a console application, at least up to CUDA 11.1. You would want to run that from a command prompt. If you are running it from within an IDE, you may have to configure the IDE so it does not close the window in which it displays console app output when the app terminates. For reference, here is what is output when I run the executable delivered with CUDA 11.1 from a command prompt on my Windows system (note that the “zu bytes” appears to be a bug in the pre-built application when the format specifier %zu is used)

C:\>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\extras\demo_suite\deviceQuery.exe"
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\extras\demo_suite\deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Quadro RTX 4000"
  CUDA Driver Version / Runtime Version          11.2 / 11.1
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 8192 MBytes (8589934592 bytes)
  (36) Multiprocessors, ( 64) CUDA Cores/MP:     2304 CUDA Cores
  GPU Max Clock rate:                            1545 MHz (1.54 GHz)
  Memory Clock rate:                             6501 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               zu bytes
  Total amount of shared memory per block:       zu bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          zu bytes
  Texture alignment:                             zu bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 101 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Quadro P2000"
  CUDA Driver Version / Runtime Version          11.2 / 11.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 5120 MBytes (5368709120 bytes)
  ( 8) Multiprocessors, (128) CUDA Cores/MP:     1024 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             3504 Mhz
  Memory Bus Width:                              160-bit
  L2 Cache Size:                                 1310720 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               zu bytes
  Total amount of shared memory per block:       zu bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          zu bytes
  Texture alignment:                             zu bytes
  Concurrent copy and kernel execution:          Yes with 5 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 23 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.2, CUDA Runtime Version = 11.1, NumDevs = 2, Device0 = Quadro RTX 4000, Device1 = Quadro P2000
Result = PASS
1 Like

Thank you so much, this did the trick! It turns out my issue was trying to execute deviceQuery from WSL, but cmd ran just fine! Thank you for the thorough answer!

Thank you for the response! This might be different for CUDA 11… deviceQuery.exe definitely shipped in the Debug directory. Thank you again for the suggestion!

Correct, the location in which NVIDIA places deviceQuery.exe keeps on changing between versions. But it has always been (and I would expect it to continue to be) a console app on Windows, and therefore is best invoked from a Windows command prompt.