<<<(w*h+255)/256, 256 >>> is not executing the global function code. why?


Kernal<<<(wh+255)/256, 256 >>>( char in, char* out );

I’m running a global function with <<<(w*h+63)/64, 64 >>>. It is working fine ( i.e…, output is getting ) and taking 2.8 sec.

I have tested it with <<<(w*h+255)/256, 256 >>> the config, then Im not getting the output and taking 16ms.

This means with the second config, the global function is not at all executed. right?

any function which is Before and after the Kernal() function with the config <<<(w*h+255)/256, 256 >>> is WORKING fine.

whats the reason for this?

Check the error returned from the kernel, it should give you more information as to why the kernel fails.
Probably the configuration is wrong.
What are the values for w, h and the threads/blocks you use?


w = 800;

h = 600;

blocks = (800*700+255)/256 = 1875

threadsPerBlock = 256;

Probably too many registers per thread. Look at cubin to find out.

The number of block in your kernel is 1875, which on multiplication with 256 threads(1875*256=4,80,000) cross the permissible grid size of 65,535.

So the Kernel does not get launched. Need to restrict the grid size.

Thats probably not right. The 256, as I understood him, is the threads per block so you dont have to multiple the 1875 by 256.

Probably what Jamie said, or just make sure you call cudaThreadSync and check the error code !!!


Yea, I was just reffering to the apendix A.1.1 line 3 of Programming guide, which says max grid size is 65,535.

Earlier, I have restricted my blocks to 128 with 512, keeping in view with the programming guide,

Just now ,I ran my kernel of 512 threads and 130 blocks. The results are fine.

So, Is the grid size rule not applicable or what is going on Internally.

the error from

cudaError error = cudaGetLastError();

is “cudaErrorLaunchOutOfResources”.

How can I fix this?

You Need to Reduce number of resources, Register, Smem etc usage.

start with adding --ptxas-options="-v -mem" to the compile command line, so you can see the amount of resources you’re using.


Grid size limit of 65535 is number of blocks (in each dimension x and y), not number of threads. Total threads could conceivably be as large as 6553565535512, provided nothing else prevents it.

Manjunath, you are very likely using too many registers. To fix this, run fewer threads per block.

From the 2.2 Programming Guide, pages 71/72:

Thanks, Jamie. That was right. For a long time, I was mistaken. Manjunath , Can u just post your kernel code( If that is possible)

cpu side code looks like …

void Kernal( unsigned char* p1, …unsigned char* p6, long srcw, long srch, long dstw, long dsth )


shared long inw, inh, limit;

inw = srcw-10;

inh = srch-10;

limit = inw * inh;

for ( int y = 0; y < inh; ++y )


for(int i = 0; i < inw ; ++i )


// here again 4 more for loops.




I have written cuda side code like this…


void Kernal( unsigned char* p1, …unsigned char* p6, long srcw, long srch, long dstw, long dsth )


shared long inw, inh, limit;

inw = srcw-10;

inh = srch-10;

limit = inw * inh;

int idx = blockIdx.x * blockDim.x + threadIdx.x;

//for ( int y = 0; y < inh; ++y )

if ( idx < limit )


int y = idx/(inw+1);

int x = idx%(inw+1);

//for(int i = 0; i < inw ; ++i )

if ( x < inw )


// here again 4 more for loops.







Kernal<<<(w*h+511)/512, 512>>>( w,h, other params… );

Im not passig the third param ( size of shared memory ) in the config, but Im using shared memory in Kernal.

Is this the reason for “cudaErrorOutOfResources”?.

Instead of guessing, you should compile with the ptax-option (as specified above) and you’ll get exactly
how many registers/shared memory you use in each kernel.


yea, run the code as Eyal pointed out , that will give the exact reg count and smem …

Done, this is the resource list for my kernal function

1>ptxas info : Compiling entry function ‘MyKernalPhS_lllPtS0_PsS1

1>ptxas info : Used 28 registers, 96+64 bytes lmem, 68+64 bytes smem, 20 bytes cmem[1]

1>Memory space statistics for ‘OCG memory pool for function Z29MyKernalPhS_lllPtS0_PsS1



Page size : 0x1000 bytes

Total allocated : 0x31d368 bytes

Total available : 0x1e438 bytes

Nrof small block pages : 463

Nrof large block pages : 107

Longest free list size : 1

Average free list size : 0

On a 1.1 device, there are a total of 8192 registers available. With 512 threads per block, each thread may use no more than 16 registers. You are using 28.

Launch the kernel with fewer threads per block.