Sorry for posting in what is clearly the wrong place, but I can’t seem to find what the right place would be.
When running devicequery, it’s showing “zu bytes” for “Total amount of shared memory per block” (along with a couple others). And has in fact been doing so for every version since v10.1.
I doubt I’m the first person to notice, but perhaps no one else can figure out where to report it either. Presumably someone updated the code to allow for larger values and mucked up the printf format string.
I’ve built and run the device query sample from CUDA 11.4 and it doesn’t show that. This is what I see:
$ /usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery
/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 4 CUDA Capable device(s)
Device 0: "Tesla V100-PCIE-32GB"
CUDA Driver Version / Runtime Version 11.4 / 11.4
CUDA Capability Major/Minor version number: 7.0
Total amount of global memory: 32510 MBytes (34089730048 bytes)
(080) Multiprocessors, (064) CUDA Cores/MP: 5120 CUDA Cores
GPU Max Clock rate: 1380 MHz (1.38 GHz)
Memory Clock rate: 877 Mhz
Memory Bus Width: 4096-bit
L2 Cache Size: 6291456 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total shared memory per multiprocessor: 98304 bytes
...
The problem may be specific to something you are doing. Furthermore, when I look at the relevant line in the public repo, I don’t see anything that looks out of order there.
This used to be a common problem with older versions of MSVC on Windows. While %zu
is the correct printf
format specifier for printing size_t
(since ISO C99, 23 years ago), MSVC did not support this format for many years.
If the code is built with MSVC 2019 (or newer), the issue should disappear. The source code for deviceQuery
should be contained in the CUDA installation package.
I’m running the pre-built version that downloaded with the various Windows installs. Here’s the output from 11.6:
Detected 1 CUDA Capable device(s)
Device 0: "NVIDIA GeForce RTX 2070"
CUDA Driver Version / Runtime Version 11.5 / 11.6
CUDA Capability Major/Minor version number: 7.5
Total amount of global memory: 8192 MBytes (8589606912 bytes)
(36) Multiprocessors, ( 64) CUDA Cores/MP: 2304 CUDA Cores
GPU Max Clock rate: 1440 MHz (1.44 GHz)
Memory Clock rate: 7001 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 4194304 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: zu bytes
Total amount of shared memory per block: zu bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 1024
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: zu bytes
Texture alignment: zu bytes
Concurrent copy and kernel execution: Yes with 6 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device supports Compute Preemption: Yes
Supports Cooperative Kernel Launch: Yes
Supports MultiDevice Co-op Kernel Launch: No
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 11.5, CUDA Runtime Version = 11.6, NumDevs = 1, Device0 = NVIDIA GeForce RTX 2070
Result = PASS
I must be expressing myself poorly today. The suggestion was to build the app yourself from source code, using MSVC 2019 or newer.
If you wish to report to NVIDIA that they should use an appropriate version of MSVC to compile the shipped executable, you can file a bug report with them. There is a pinned post “How to report a bug” at the top of this forum.
Downloading the source and building with vs2022 does indeed produce the correct results.
I’ll open a bug to fix this in the released builds.