deviceQuery zu

Sorry for posting in what is clearly the wrong place, but I can’t seem to find what the right place would be.

When running devicequery, it’s showing “zu bytes” for “Total amount of shared memory per block” (along with a couple others). And has in fact been doing so for every version since v10.1.

I doubt I’m the first person to notice, but perhaps no one else can figure out where to report it either. Presumably someone updated the code to allow for larger values and mucked up the printf format string.

Moved to CUDA forum

I’ve built and run the device query sample from CUDA 11.4 and it doesn’t show that. This is what I see:

$ /usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery
/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 4 CUDA Capable device(s)

Device 0: "Tesla V100-PCIE-32GB"
  CUDA Driver Version / Runtime Version          11.4 / 11.4
  CUDA Capability Major/Minor version number:    7.0
  Total amount of global memory:                 32510 MBytes (34089730048 bytes)
  (080) Multiprocessors, (064) CUDA Cores/MP:    5120 CUDA Cores
  GPU Max Clock rate:                            1380 MHz (1.38 GHz)
  Memory Clock rate:                             877 Mhz
  Memory Bus Width:                              4096-bit
  L2 Cache Size:                                 6291456 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total shared memory per multiprocessor:        98304 bytes
   ...

The problem may be specific to something you are doing. Furthermore, when I look at the relevant line in the public repo, I don’t see anything that looks out of order there.

This used to be a common problem with older versions of MSVC on Windows. While %zu is the correct printf format specifier for printing size_t (since ISO C99, 23 years ago), MSVC did not support this format for many years.

If the code is built with MSVC 2019 (or newer), the issue should disappear. The source code for deviceQuery should be contained in the CUDA installation package.

I’m running the pre-built version that downloaded with the various Windows installs. Here’s the output from 11.6:

Detected 1 CUDA Capable device(s)

Device 0: "NVIDIA GeForce RTX 2070"
  CUDA Driver Version / Runtime Version          11.5 / 11.6
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 8192 MBytes (8589606912 bytes)
  (36) Multiprocessors, ( 64) CUDA Cores/MP:     2304 CUDA Cores
  GPU Max Clock rate:                            1440 MHz (1.44 GHz)
  Memory Clock rate:                             7001 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 4194304 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               zu bytes
  Total amount of shared memory per block:       zu bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          zu bytes
  Texture alignment:                             zu bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            Yes
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.5, CUDA Runtime Version = 11.6, NumDevs = 1, Device0 = NVIDIA GeForce RTX 2070
Result = PASS

I must be expressing myself poorly today. The suggestion was to build the app yourself from source code, using MSVC 2019 or newer.

If you wish to report to NVIDIA that they should use an appropriate version of MSVC to compile the shipped executable, you can file a bug report with them. There is a pinned post “How to report a bug” at the top of this forum.

Downloading the source and building with vs2022 does indeed produce the correct results.

I’ll open a bug to fix this in the released builds.