CUDA 5.0 - cudaGetDeviceProperties returns very strange grid size

Hi guys,

I’m just starting to learn CUDA and was really surprised that, contrary to what I found on the Internet, my card (GeForce GTX 660M) supports some insane grid sizes (2147483647 x 65535 x 65535). Please take a look at the following results I’m getting from deviceQuery.exe provided with the toolkit:

c:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\bin\win64\Release>deviceQuery.exe
deviceQuery.exe Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: “GeForce GTX 660M”
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147287040 bytes)
( 2) Multiprocessors x (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GTX 660M

I was curious enough to write a simple program testing if it’s possible to use more than 65535 blocks in the first dimension of the grid, but it crashes confirming what I found on the Internet (or, to be more precise, works fine for 65535 blocks and crashes for 65536).

So my question is: is cudaGetDeviceProperties returning rubbish values or am I doing something wrong?


I think you’re doing something wrong. I’ve successfully tested grids on sm_3x devices with grid.x > 65535.

Perhaps your kernel is getting killed by the Windows driver’s watchdog timer because it’s running for more than a few seconds?

Or perhaps your kernel is indexing based on the large grid values and generating a memory exception?

Thanks a lot for your quick reply.

I must have gotten something wrong then. My program is extremely simple and basically just adds two vectors. It definitely doesn’t take even a second to run (and that’s including all cudaMallocs and cudaMemcpys). Am I missing something obvious here? Please find my source below:

#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <math.h>

#pragma comment(lib, "cudart") 

typedef struct 
	float *content;
	const unsigned int size;
} pjVector_t;

__global__ void AddVectorsKernel(float *firstVector, float *secondVector, float *resultVector)
	unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
	resultVector[index] = firstVector[index] + secondVector[index];

int main(void)
	const unsigned int vectorLength = 67107840; // 1024 * 65535 - works fine
	//const unsigned int vectorLength = 67108864; // 1024 * 65536 - crashes
	const unsigned int vectorSize = sizeof(float) * vectorLength;
	int threads = 0;
	unsigned int blocks = 0;
	cudaDeviceProp deviceProperties;

	cudaGetDeviceProperties(&deviceProperties, 0);

	threads = deviceProperties.maxThreadsPerBlock;
	blocks = (unsigned int)ceil(vectorLength / (double)threads);

	pjVector_t firstVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
	pjVector_t secondVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
	pjVector_t resultVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };

	float *d_firstVector;
	float *d_secondVector;
	float *d_resultVector;

	cudaMalloc((void **)&d_firstVector, vectorSize);
	cudaMalloc((void **)&d_secondVector, vectorSize);
	cudaMalloc((void **)&d_resultVector, vectorSize);

	for (unsigned int i = 0; i < vectorLength; i++)
		firstVector.content[i] = 1.0f;
		secondVector.content[i] = 2.0f;

	cudaMemcpy(d_firstVector, firstVector.content, vectorSize, cudaMemcpyHostToDevice);
	cudaMemcpy(d_secondVector, secondVector.content, vectorSize, cudaMemcpyHostToDevice);
	AddVectorsKernel<<<blocks, threads>>>(d_firstVector, d_secondVector, d_resultVector);
	cudaMemcpy(resultVector.content, d_resultVector, vectorSize, cudaMemcpyDeviceToHost);

	for (unsigned int i = 0; i < vectorLength; i++)
		if(resultVector.content[i] != 3.0f)


			printf("Error under index: %i\n", i);

			return 1;



	printf("Everything ok!\n");

	return 0;


How does the app “crash”? What happens when you add error status checks to every CUDA API call and every kernel launch?

The code runs on a K20c and a GTX 680 (4GB) with both vectorLength sizes. I looked at the code and couldn’t spot any obvious problem.

As @njuffa suggests, all those error codes are worth checking.

You’re allocating a fair amount of memory (at least 786,420 MB) but your 660M should have most of 2GB free.

You might also want to check you don’t have an zombie instance of your app running in the background (I’ve had that happen before).

Thanks for your replies.

You’re right @njuffa, I should have clarified what it means “crashes”. When I run it from Visual Studio in debug mode (67108864 vector), the last cudaMemcpy always fills my resultVector with seemingly random data (very close to 0 if it matters) so that the result doesn’t pass the final validation. Where it actually seems like it crashes is in the profiler, which returns following error message:

2 events, 0 metrics and 0 source-level metrics were not associated with the kernels and will not be displayed

As a result, profiler measures only cudaMalloc and cudaMemcpy operations and doesn’t even show the kernel execution.

As per error status checking (and I’m not sure I’m doing it right - apologies if so), cudaPeekAtLastError function returns cudaErrorInvalidValue(11) error. All other operations (cudaMalloc and cudaMemcpy) return cudaSuccess(0).

I hope I shed some more light on my problem but please let me know if you have any further questions.



Thanks for the zombie instance hint - I checked and unfortunately there is none. Also, thanks for running my code on your hardware, much appreciated.


From the symptoms you describe, it seems the kernel in question either never executed, or died, and thus the final device->host copy copies back garbage, and the profiler has no record for this kernel since it never ran to completion.

In conjunction with allanmac’s observations it seems now would be a good time to add 100% error check coverage to this code. The resulting error message(s) should let you pinpoint what the issue is.

I finally fixed it with a lot of help from stackoverflow ( As it turns out the code itself was ok, the only problem was the code generation flags I didn’t change. Visual Studio quite confusingly sets them initially to:


while for my card, more appropriate option would be


Thanks for all your help,

What’s actually interesting, since I’ve been using compute profile 1.0, how was it possible to create 1024 threads per block? Have I missed something?

Interesting answer on StackOverflow!

I had always (wrongly) assumed that the special registers %tid, %ntid, %ctaid and %nctaid would entirely isolate kernels from architectural CTA-size differences.

Looking at the PTX manual reveals that “legacy code” will use a “mov.u16” to access the low 16 bits of these special registers.

Compiling for sm_10 and dumping the PTX (or SASS) verifies that 16-bit mov’s are used:

$> nvcc -m 32 -arch sm_10 -ptx
	mov.u16 	%rh1, %ctaid.x;
	mov.u16 	%rh2, %ntid.x;
	mul.wide.u16 	%r1, %rh1, %rh2;
	cvt.u32.u16 	%r2, %tid.x;
	add.u32 	%r3, %r2, %r1;
	mul.lo.u32 	%r4, %r3, 4;

1024 thread blocks continued to work because “mov.u16 %rh1,%ntid.x;” easily fits into 16 bits. The failure is in the legacy 16-bit move of “%ctaid.x”.

The descriptions for %ntid and %ctaid are at the very end of the PTX manual.