GeForce GTX 560 Ti block dimension broblem

my graphic card is

Device 0: “GeForce GTX 560 Ti”
CUDA Driver Version: 5.0
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 2048 MBytes (2147024896 bytes)
( 8) Multiprocessors x ( 48) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 1645 MHz (1.64 GHz)
Memory Clock rate: 2004 Mhz
Memory Bus Width: 256-bit
L2 Cache Size: 524288 bytes
Max Texture Dimension Sizes 1D=(65536) 2D=(65536,65535) 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per multiprocessor: 1536
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Texture alignment: 512 bytes
Maximum memory pitch: 2147483647 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0

and i want set block dimension like that dim3 dimBlock(1024, 1, 1) but it gives me a error “Error too many resources requested for launch at line 431 in file main.cpp”
but if i set dimension like that dim3 dimBlock(512, 2, 1) it does not give any error. i am confused please help me…

You are using too many resources in your kernel as the compiler tells you for the larger kernel configuration numbers – see: for example.

Edit: Yes, I used the wrong verbiage as this is not a compiler issue, but my response is still correct. Like pasoleatis said below it is either too many registers or too much shared memory use for that amount of threads for your particular kernel.

actually it is a runtime error. compiler doesnt give an error. as you see in both cases 1024 thread has been assigned. and my kernel is like that:

__global__ void vectorCUDA(const float *A, const float *B, float *C,
		int numElements) {
	int i = blockDim.x * blockIdx.x + threadIdx.x;

	if (i < numElements) {
		C[i] = A[i] - B[i];


my curiosity is what are the differences between dimBlock(1024, 1, 1) and dimBlock(512, 2, 1), why does it give runtime error when i want to use one dimensional array with 1024 thread block at x dimension?


1024 threads per block is ok to use. I think the problemi in this case is that there are not enough registers, or that there is too much shared memory requesrted. I think that in the case of reisters it is úsally okto have use more , becaue there is spilling to the local memory, but the shared memory is fiesd. In the case of FErmi architecture you can specify if you use 48 kb (which is default) or 64 kb (btut you disable the L1 cache).