block size


I’m using GeForce GTX 690, but only using device 0 (cudaSetDevice(0)).

Somehow I am able to create blocks as big as 512x512, like following parameters:

dim3 dimBlock(512,512);
dim3 dimGrid(24,24);

The kernel launches perfectly and the results are good.

But I thought you could only have at most 1024 threads in one block, so the block size can be at most 32x32.

Can someone tell me why I can have a block as big as 512x512?

thanks in advance,

The answer of that question says that the dimension of a grid is limited to 512x512x64 (Compute 1.x) and 1024x1024x64 (Compute 2.x)

@LightBlue, It is not possible on compute capability 3.5 and earlier devices to launch a grid that defines a block size equal to 512 x 512 threads. The device limits are documented in This will report an error on the next cuda* function call.

The device limits are documented in the CUDA C Programming Guide Table 10. Technical Specifications per Compute Capability as the property “Maximum number of threads per block.” The maximum for compute capability 2.* and 3.* devices is 1024 threads per block.

The CUDA C Programming Guide has a section on how to do proper error checking. See

After reviewing your error handling you still do not think there is an error I recommend that you post a reproducible.

If one submits a kernel with < < <blocks, tpb> > >
tpb.x<512 (1024)
tpb.y<512 (1024)
with tpb.xtpb.ytpb.z<512 (1024)

Ok, I did more testing. I wrote a simple program (as below).

If I use dimBlock(32,32), I got the correct result, A.width is changed to 10. If I use dimBlock(512,512) or dimBlock(48,48), I got wrong result, A.width is still 1 (not changed to 10).

But for both dimBlock(32,32) and dimBlock(512,512), I got “kernel launch good” message. Is that my error checking is not correct?


#include <stdio.h>
#include <unistd.h>

#include <cuda.h>

typedef struct
int width;
int height;
} data;

global void sample_kernel(data *);

int main ( int argc, char *argv)

data A, *d_A;

A.width=1; A.height=2;

cudaMalloc(&d_A, sizeof(data));
cudaMemcpy(d_A, &A, sizeof(data), cudaMemcpyHostToDevice);

//invoke kernel
dim3 dimBlock(32, 32);
dim3 dimGrid(24, 24);

sample_kernel<<<dimGrid, dimBlock>>>(d_A);

cudaError_t cudaerr = cudaDeviceSynchronize();
if ( cudaerr!=CUDA_SUCCESS)
printf(“kernel launch failed with err “%s”.\n”,cudaGetErrorString(cudaerr));
printf(“Kernel launch good\n”);

cudaMemcpy(&A, d_A, sizeof(data), cudaMemcpyDeviceToHost);

printf(“now A.width is %d\n”, A.width);

global void sample_kernel(data *A)
if ( blockIdx.x==1 && blockIdx.y==1 && threadIdx.x==1 && threadIdx.y==1 )
printf(“setting A width to 10\n”);

You are checking the error of the command cudaDeviceSynchronize() and not of the result of the kernel.
Just after the kernel launch you should use

if ( cudaSuccess != cudaGetLastError() )
    printf( "Error!\n" );

You could find macros here for example:

thanks nezix.

Made change by calling cudaGetLastError() first and caught the error now.

I used cudaDeviceSynchronize() first thinking the kernel is executed asynchronously, so you have to sync the device first.

Apparently Cuda should have made the kernel return status to make sure we can check the launch of kernel is successful.