I’m just starting to learn CUDA and was really surprised that, contrary to what I found on the Internet, my card (GeForce GTX 660M) supports some insane grid sizes (2147483647 x 65535 x 65535). Please take a look at the following results I’m getting from deviceQuery.exe provided with the toolkit:
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: “GeForce GTX 660M”
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147287040 bytes)
( 2) Multiprocessors x (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GTX 660M
I was curious enough to write a simple program testing if it’s possible to use more than 65535 blocks in the first dimension of the grid, but it crashes confirming what I found on the Internet (or, to be more precise, works fine for 65535 blocks and crashes for 65536).
So my question is: is cudaGetDeviceProperties returning rubbish values or am I doing something wrong?
I must have gotten something wrong then. My program is extremely simple and basically just adds two vectors. It definitely doesn’t take even a second to run (and that’s including all cudaMallocs and cudaMemcpys). Am I missing something obvious here? Please find my source below:
You’re right @njuffa, I should have clarified what it means “crashes”. When I run it from Visual Studio in debug mode (67108864 vector), the last cudaMemcpy always fills my resultVector with seemingly random data (very close to 0 if it matters) so that the result doesn’t pass the final validation. Where it actually seems like it crashes is in the profiler, which returns following error message:
2 events, 0 metrics and 0 source-level metrics were not associated with the kernels and will not be displayed
As a result, profiler measures only cudaMalloc and cudaMemcpy operations and doesn’t even show the kernel execution.
As per error status checking (and I’m not sure I’m doing it right - apologies if so), cudaPeekAtLastError function returns cudaErrorInvalidValue(11) error. All other operations (cudaMalloc and cudaMemcpy) return cudaSuccess(0).
I hope I shed some more light on my problem but please let me know if you have any further questions.
From the symptoms you describe, it seems the kernel in question either never executed, or died, and thus the final device->host copy copies back garbage, and the profiler has no record for this kernel since it never ran to completion.
In conjunction with allanmac’s observations it seems now would be a good time to add 100% error check coverage to this code. The resulting error message(s) should let you pinpoint what the issue is.
I had always (wrongly) assumed that the special registers %tid, %ntid, %ctaid and %nctaid would entirely isolate kernels from architectural CTA-size differences.
Looking at the PTX manual reveals that “legacy code” will use a “mov.u16” to access the low 16 bits of these special registers.
Compiling for sm_10 and dumping the PTX (or SASS) verifies that 16-bit mov’s are used: