<<<(w*h+255)/256, 256 >>> is not executing the global function code. why?

Manjunath_Gudisi · May 18, 2009, 9:59am

Hello,

Kernal<<<(wh+255)/256, 256 >>>( char in, char* out );

I’m running a global function with <<<(w*h+63)/64, 64 >>>. It is working fine ( i.e…, output is getting ) and taking 2.8 sec.

I have tested it with <<<(w*h+255)/256, 256 >>> the config, then Im not getting the output and taking 16ms.

This means with the second config, the global function is not at all executed. right?

any function which is Before and after the Kernal() function with the config <<<(w*h+255)/256, 256 >>> is WORKING fine.

whats the reason for this?

eyalhir74 · May 18, 2009, 10:56am

Check the error returned from the kernel, it should give you more information as to why the kernel fails.
Probably the configuration is wrong.
What are the values for w, h and the threads/blocks you use?

eyal

Manjunath_Gudisi · May 18, 2009, 12:58pm

w = 800;

h = 600;

blocks = (800*700+255)/256 = 1875

threadsPerBlock = 256;

Jamie_K · May 19, 2009, 1:41am

Probably too many registers per thread. Look at cubin to find out.

dlmeetei · May 19, 2009, 8:36am

The number of block in your kernel is 1875, which on multiplication with 256 threads(1875*256=4,80,000) cross the permissible grid size of 65,535.

So the Kernel does not get launched. Need to restrict the grid size.

eyalhir74 · May 19, 2009, 9:21am

Thats probably not right. The 256, as I understood him, is the threads per block so you dont have to multiple the 1875 by 256.

Probably what Jamie said, or just make sure you call cudaThreadSync and check the error code !!!

eyal

dlmeetei · May 19, 2009, 10:32am

Yea, I was just reffering to the apendix A.1.1 line 3 of Programming guide, which says max grid size is 65,535.

Earlier, I have restricted my blocks to 128 with 512, keeping in view with the programming guide,

Just now ,I ran my kernel of 512 threads and 130 blocks. The results are fine.

So, Is the grid size rule not applicable or what is going on Internally.

Manjunath_Gudisi · May 19, 2009, 10:49am

the error from

cudaError error = cudaGetLastError();

is “cudaErrorLaunchOutOfResources”.

How can I fix this?

dlmeetei · May 19, 2009, 11:04am

You Need to Reduce number of resources, Register, Smem etc usage.

eyalhir74 · May 19, 2009, 12:44pm

start with adding --ptxas-options=“-v -mem” to the compile command line, so you can see the amount of resources you’re using.

eyal

Jamie_K · May 19, 2009, 7:25pm

Grid size limit of 65535 is number of blocks (in each dimension x and y), not number of threads. Total threads could conceivably be as large as 6553565535512, provided nothing else prevents it.

Manjunath, you are very likely using too many registers. To fix this, run fewer threads per block.

From the 2.2 Programming Guide, pages 71/72:

dlmeetei · May 20, 2009, 4:19am

Thanks, Jamie. That was right. For a long time, I was mistaken. Manjunath , Can u just post your kernel code( If that is possible)

Manjunath_Gudisi · May 20, 2009, 8:39am

cpu side code looks like …

void Kernal( unsigned char* p1, …unsigned char* p6, long srcw, long srch, long dstw, long dsth )

{

shared long inw, inh, limit;

inw = srcw-10;

inh = srch-10;

limit = inw * inh;

for ( int y = 0; y < inh; ++y )

{

for(int i = 0; i < inw ; ++i )

{

// here again 4 more for loops.

}

I have written cuda side code like this…

global

void Kernal( unsigned char* p1, …unsigned char* p6, long srcw, long srch, long dstw, long dsth )

{

shared long inw, inh, limit;

inw = srcw-10;

inh = srch-10;

limit = inw * inh;

int idx = blockIdx.x * blockDim.x + threadIdx.x;

//for ( int y = 0; y < inh; ++y )

if ( idx < limit )

{

int y = idx/(inw+1);

int x = idx%(inw+1);

//for(int i = 0; i < inw ; ++i )

if ( x < inw )

{

// here again 4 more for loops.

}

calling

w=640;

h=600;

Kernal<<<(w*h+511)/512, 512>>>( w,h, other params… );

Im not passig the third param ( size of shared memory ) in the config, but Im using shared memory in Kernal.

Is this the reason for “cudaErrorOutOfResources”?.

eyalhir74 · May 20, 2009, 9:12am

Hi,
Instead of guessing, you should compile with the ptax-option (as specified above) and you’ll get exactly
how many registers/shared memory you use in each kernel.

eyal

dlmeetei · May 20, 2009, 10:49am

yea, run the code as Eyal pointed out , that will give the exact reg count and smem …

Manjunath_Gudisi · May 20, 2009, 11:41am

Done, this is the resource list for my kernal function

1>ptxas info : Compiling entry function ‘MyKernalPhS_lllPtS0_PsS1’

1>ptxas info : Used 28 registers, 96+64 bytes lmem, 68+64 bytes smem, 20 bytes cmem[1]

1>Memory space statistics for ‘OCG memory pool for function Z29MyKernalPhS_lllPtS0_PsS1’

1>=========================================================

====================================================

Page size : 0x1000 bytes

Total allocated : 0x31d368 bytes

Total available : 0x1e438 bytes

Nrof small block pages : 463

Nrof large block pages : 107

Longest free list size : 1

Average free list size : 0

Jamie_K · May 20, 2009, 12:28pm

On a 1.1 device, there are a total of 8192 registers available. With 512 threads per block, each thread may use no more than 16 registers. You are using 28.

Launch the kernel with fewer threads per block.