launch terminates in cudaDeviceSynchronize() after timeout

I try to run a simple program with 3 dimensional grid but for some reason when I launch it with cuda-memcheck it just gets stuck, and after the timeout it’s terminated. The problem has nothing to do with a short timeout cause I changed it just for this manner to 60 seconds.

The code I run has a grid of 45x1575x1575 and it runs an empty global function. additional info: My compute capability is 2.1 and I run with the flag -maxrregcount=24 to limit the number of registers the device functions can use (saw in some other program of mine that it gives the best results with the occupancy calculator)

Here’s my code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

__global__ void stam(int a){


int main()

    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaError_t cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");

    dim3 gridSize(45,1575,1575);
    cudaStatus = cudaDeviceSynchronize(); // This function gets stuck
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!!");

    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;

    return 0;

Isn’t the max grid size 65535x65535x65535? What is the problem in here?

It only crashes when I compile it with the -G flag. otherwise it’s just slow, but it doesn’t exceed the 60 seconds… Also, I have everything up to date. Any ideas?

What operating system?

I had a similar problem with Ubuntu 12.10, but not in Windows 7 for code run through cuda-memcheck.

Does that particular GPU have the video out?

OP cross posted and accepted an answer here: