Launching Kernel Fail

I’m new to CUDA.
I have written the Kernel and I’m launching the same with launch parameter as 16 blocks and 256 threads per block. The kernel is not launched at all.
But if I try the same with 16 blocks and 128 threads per block, it is launching nicely.
So do I have to check the device’s capability for the number of threads supported per block?

When I checked the device info, here is the detail:
Card: GeForce GT 530
MaxThreadsPerBlock : 512
Warp size : 32
MaxThreadsPerMultiProcessor: 1024

And How to know what is the maximum number of threads that can be launched?
Is there any way to find what is the max number of blocks can be used on the device?

With regards to the max number of blocks, reference the max grid dimensions under the compute capability specifications in the programming guide

But I doubt whether your problem relates to the number of blocks you are using

How much shared memory does a block use - make sure that a block can actually be scheduled by a SM, based on the amount of shared memory it uses
You are likely not exceeding the max block dimensions per SM, but you may be exceeding the shared memory limit per SM

Perhaps also pay attention to the error type specified when the kernel launch fails, to aid your problem identification

But I’m not using any more shared memory right. I dont think so the max block dimension should be a problem. I’m using only one dimension that too only with 16 blocks.

If you full-heartedly believe that the error is indeed caused by the block dimension, then simply confirm your hypothesis by referencing your device’s compute capability specifications in die programming guide, paying particular attention to the block dimension specifications

Are you using shared memory at all?
Depending on how you have allocated shared memory, the mere fact that it works when you step down from 256 to 128 threads, may equally indicate that the problem is caused by shared memory exceeding limits

maxThreadsDim is 512,512,64
MaxGridSize is 65535,65535,1
I’m not sure whether the grid dimension cause the issue or not :(

And your dimensions are (256,1,1) and (16,1,1)
So I doubt that you are exceeding any limits

Again, are you using shared memory at all (in your kernel)?

My thread dim is 16,1,1 where as MaxThreadDim is 512,512,64.
My grid size is 256,1,1 where as MaxGridSize is 65535,65535,1.
So I’m not exceeding the limits I hope. Can you let me know how am I exceeding the limits?

As I have also mentioned previously, and as you now clearly see yourself, you are likely not exceeding the block/ grid dimensions

But you may exceed shared memory limits, to the extent that a SM can not even schedule a single block

Again, are you using shared memory at all (in your kernel)?

In other words, have you used shared in your kernel at all, as part of variable declarations?

no no. I have mentioned already that I’m not using any shared memory at all.

Your GRID is 16,1,1 not your block, I would think
Similarly, your BLOCK is 256,1,1, not your grid

Perhaps post the code section where you launch the kernel

Yes.Sorry for the confusion.

int nbBlocks = 16;
int iMaxThreadsPerBlock = 256;

//Invoke the kernel
Parallel_Sum<<<nbBlocks,iMaxThreadsPerBlock>>>( d_arrIn, d_arrOut );

You want me to post the entire code?

post the declarations of d_arrin and d_arrout, and any memory allocations (cudaMalloc etc) you have from that point up to the launching of the kernel

#define MAX_ARRAY_SIZE 4096 //( 16 * 256 )

global void Parallel_Sum(float* arrIn, float* arrOut )

int globalID = threadIdx.x + ( blockIdx.x * blockDim.x );
int threadID = threadIdx.x;

for( unsigned int threadCtr = ( blockDim.x/2 ); threadCtr > 0 ; threadCtr >>=1 )
	if( threadID <  threadCtr )

		arrIn[globalID]  += arrIn[globalID+threadCtr];
if( threadID == 0 )

	arrOut[blockIdx.x] = arrIn[globalID];


void main()
float* h_arrIn = new float[ MAX_ARRAY_SIZE ];
float* h_arrOut= new float[ MAX_ARRAY_SIZE ];

//Initialize the input array
int i = 0;
for( ; i < MAX_ARRAY_SIZE; i++ )
	h_arrIn[i] = i;

//Device memory
int iByteSize =  sizeof(float) * MAX_ARRAY_SIZE;
float *d_arrIn,*d_arrOut;
cudaMalloc((void**)&d_arrIn, iByteSize );
cudaMalloc((void**)&d_arrOut, iByteSize );

//Copy the contents from host memory to device memory
cudaMemcpy(d_arrIn, h_arrIn, iByteSize , cudaMemcpyHostToDevice );

int nbBlocks = 16;
int iMaxThreadsPerBlock = 256;

//Invoke the kernel
Parallel_Sum<<<nbBlocks,iMaxThreadsPerBlock>>>(  d_arrIn, d_arrOut );

//Copy back the results from device to host memory
cudaMemcpy(h_arrOut, d_arrOut, iByteSize , cudaMemcpyDeviceToHost);

//Print the results
i = 0;
for( ; i < MAX_ARRAY_SIZE; i++ )


Your code works fine for me.

In the case where you think it’s failing (i.e. as posted, with 16 blocks of 256 threads), I suggest running the code with cuda-memcheck

I suspect it will report 0 errors. If your kernel was not launching, you would not get 0 errors from cuda-memcheck.

The output printout is confusing, because most of the h_arrOut array is not being set or initialized anywhere, so you get a lot of strange values. The only valid values in the h_arrOut are the first 16 (one per block). If you limit your printout to the first 16 values, I think you’ll get sensible results.

Also, any time you’re having trouble with a cuda code, it’s a good idea to do proper cuda error checking on all cuda API calls and kernel calls. I don’t see that in your code. If you did proper cuda error checking, it would indicate for sure whether the kernel is launching or not, and give you some indication of the reason for failure if it is not launching.