Launching Kernel Fail

Syed_Babu · May 27, 2014, 6:14am

I’m new to CUDA.
I have written the Kernel and I’m launching the same with launch parameter as 16 blocks and 256 threads per block. The kernel is not launched at all.
But if I try the same with 16 blocks and 128 threads per block, it is launching nicely.
So do I have to check the device’s capability for the number of threads supported per block?

Syed_Babu · May 27, 2014, 6:23am

When I checked the device info, here is the detail:
Card: GeForce GT 530
MaxThreadsPerBlock : 512
Warp size : 32
MaxThreadsPerMultiProcessor: 1024

Syed_Babu · May 27, 2014, 6:25am

And How to know what is the maximum number of threads that can be launched?
Is there any way to find what is the max number of blocks can be used on the device?

little_jimmy · May 27, 2014, 10:13am

With regards to the max number of blocks, reference the max grid dimensions under the compute capability specifications in the programming guide

But I doubt whether your problem relates to the number of blocks you are using

How much shared memory does a block use - make sure that a block can actually be scheduled by a SM, based on the amount of shared memory it uses
You are likely not exceeding the max block dimensions per SM, but you may be exceeding the shared memory limit per SM

Perhaps also pay attention to the error type specified when the kernel launch fails, to aid your problem identification

Syed_Babu · May 27, 2014, 10:32am

But I’m not using any more shared memory right. I dont think so the max block dimension should be a problem. I’m using only one dimension that too only with 16 blocks.

little_jimmy · May 27, 2014, 11:30am

If you full-heartedly believe that the error is indeed caused by the block dimension, then simply confirm your hypothesis by referencing your device’s compute capability specifications in die programming guide, paying particular attention to the block dimension specifications

Are you using shared memory at all?
Depending on how you have allocated shared memory, the mere fact that it works when you step down from 256 to 128 threads, may equally indicate that the problem is caused by shared memory exceeding limits

Syed_Babu · May 27, 2014, 12:54pm

maxThreadsDim is 512,512,64
MaxGridSize is 65535,65535,1
I’m not sure whether the grid dimension cause the issue or not :(

little_jimmy · May 27, 2014, 1:24pm

And your dimensions are (256,1,1) and (16,1,1)
So I doubt that you are exceeding any limits

Again, are you using shared memory at all (in your kernel)?

Syed_Babu · May 28, 2014, 5:32am

My thread dim is 16,1,1 where as MaxThreadDim is 512,512,64.
My grid size is 256,1,1 where as MaxGridSize is 65535,65535,1.
So I’m not exceeding the limits I hope. Can you let me know how am I exceeding the limits?

little_jimmy · May 28, 2014, 8:06am

As I have also mentioned previously, and as you now clearly see yourself, you are likely not exceeding the block/ grid dimensions

But you may exceed shared memory limits, to the extent that a SM can not even schedule a single block

Again, are you using shared memory at all (in your kernel)?

In other words, have you used shared in your kernel at all, as part of variable declarations?

Syed_Babu · May 28, 2014, 8:10am

no no. I have mentioned already that I’m not using any shared memory at all.

little_jimmy · May 28, 2014, 8:22am

Your GRID is 16,1,1 not your block, I would think
Similarly, your BLOCK is 256,1,1, not your grid

Perhaps post the code section where you launch the kernel

Syed_Babu · May 28, 2014, 8:28am

Yes.Sorry for the confusion.

int nbBlocks = 16;
int iMaxThreadsPerBlock = 256;

//Invoke the kernel
Parallel_Sum<<<nbBlocks,iMaxThreadsPerBlock>>>( d_arrIn, d_arrOut );

You want me to post the entire code?

little_jimmy · May 28, 2014, 9:38am

post the declarations of d_arrin and d_arrout, and any memory allocations (cudaMalloc etc) you have from that point up to the launching of the kernel

Syed_Babu · May 28, 2014, 12:02pm

#define MAX_ARRAY_SIZE 4096 //( 16 * 256 )

global void Parallel_Sum(float* arrIn, float* arrOut )

{
int globalID = threadIdx.x + ( blockIdx.x * blockDim.x );
int threadID = threadIdx.x;

for( unsigned int threadCtr = ( blockDim.x/2 ); threadCtr > 0 ; threadCtr >>=1 )
{
	if( threadID <  threadCtr )
	{

		arrIn[globalID]  += arrIn[globalID+threadCtr];
	}
	__syncthreads();
}
if( threadID == 0 )
{

	arrOut[blockIdx.x] = arrIn[globalID];
}

}

void main()
{
float* h_arrIn = new float[ MAX_ARRAY_SIZE ];
float* h_arrOut= new float[ MAX_ARRAY_SIZE ];

//Initialize the input array
int i = 0;
for( ; i < MAX_ARRAY_SIZE; i++ )
{
	h_arrIn[i] = i;
}


//Device memory
int iByteSize =  sizeof(float) * MAX_ARRAY_SIZE;
float *d_arrIn,*d_arrOut;
cudaMalloc((void**)&d_arrIn, iByteSize );
cudaMalloc((void**)&d_arrOut, iByteSize );


//Copy the contents from host memory to device memory
cudaMemcpy(d_arrIn, h_arrIn, iByteSize , cudaMemcpyHostToDevice );


int nbBlocks = 16;
int iMaxThreadsPerBlock = 256;

//Invoke the kernel
Parallel_Sum<<<nbBlocks,iMaxThreadsPerBlock>>>(  d_arrIn, d_arrOut );


//Copy back the results from device to host memory
cudaMemcpy(h_arrOut, d_arrOut, iByteSize , cudaMemcpyDeviceToHost);

//Print the results
i = 0;
for( ; i < MAX_ARRAY_SIZE; i++ )
{
	cout<<h_arrOut[i]<<endl;
}

}

Robert_Crovella · May 28, 2014, 1:57pm

Your code works fine for me.

In the case where you think it’s failing (i.e. as posted, with 16 blocks of 256 threads), I suggest running the code with cuda-memcheck

I suspect it will report 0 errors. If your kernel was not launching, you would not get 0 errors from cuda-memcheck.

The output printout is confusing, because most of the h_arrOut array is not being set or initialized anywhere, so you get a lot of strange values. The only valid values in the h_arrOut are the first 16 (one per block). If you limit your printout to the first 16 values, I think you’ll get sensible results.

Also, any time you’re having trouble with a cuda code, it’s a good idea to do proper cuda error checking on all cuda API calls and kernel calls. I don’t see that in your code. If you did proper cuda error checking, it would indicate for sure whether the kernel is launching or not, and give you some indication of the reason for failure if it is not launching.

Topic		Replies	Views
Max threads/block CUDA Programming and Performance	10	22201	March 7, 2011
block size CUDA Programming and Performance	6	5817	July 21, 2013
Launch out of Resources: Why? CUDA Programming and Performance	12	14389	May 28, 2008
threads and blocks CUDA Programming and Performance	3	1339	May 7, 2012
Shared memory limits and cudaError_enum How to precisely determine how much of the shared memory is CUDA Programming and Performance	5	2810	April 29, 2009
Can not use more than 16*256 threads! CUDA Programming and Performance	7	2458	August 4, 2008
Limit on the size of data that can be processed by a kernel Newbie question CUDA Programming and Performance	2	1347	January 16, 2009
Maximum number of threads on thread block CUDA Programming and Performance	12	71202	September 21, 2023
New findings needed to be verified: Maximum thread block is not 1024 in K20 CUDA Programming and Performance	4	753	November 17, 2014
too many resources requested for launch CUDA Programming and Performance	28	24816	December 1, 2010

Launching Kernel Fail

Related Topics