Kernel Works on lesser number of threads but fails as I increase total threads spawned "CUDA err

sidxavier · February 27, 2012, 7:04pm

Hi all,

I have a ‘persistent threads’ type implementation to handle a worklist (producer-consumer) type algorithm.
It is a heavy kernel,has many local variables and double precision computation.
To sum it up - I have kernel that uses lots of register and local memory. It fails with an unspecified launch error when i spawn more threads (works of 128 X 30 but fails for 192 X 30 with unspecified launch error).

More description -
I am using GTX280 with Cuda compilation tools, release 4.1, V0.2.1221 AND NVIDIA UNIX x86_64 Kernel Module 285.05.32

My kernel gives perfect results when compiled with upto 128 threads per block and 30 blocks.
By default (without using maxrregcount ) the ptax info obtained FOR “sm_13” is –
ptxas info : Used 110 registers, 2184+0 bytes lmem, 112+16 bytes smem, 156 bytes cmem[1]

As you can see - its currently a not very well written code in a way that it has heavy register and local memory usage.

In order to increase the number of active blocks (warps) I used maxrregcount = 75 (to reduce reg usage and thus increase block size to 192 from 128). I get the following ptax info corresponding to this build.
ptxas info : Used 75 registers, 2296+0 bytes lmem, 112+16 bytes smem, 140 bytes cmem[1]
AND the code fails with an error - ERROR: cudaGetLastError() returned unspecified launch failure (err#4)

[b]Is this failure due to the local/constant memory usage? which goes out of bound upon increase in total threads spawned?

Thanks
Sid.

sidxavier · February 27, 2012, 7:11pm

Also when I run with block_size 160 and gridSZ = 30
It works for maxrregcount = 75 with ptxas info : Used 75 registers, 2296+0 bytes lmem, 112+16 bytes smem, 140 bytes cmem[1]
But fails for maxrregcount = 50 with ptxas info : Used 50 registers, 3032+0 bytes lmem, 112+16 bytes smem, 140 bytes cmem[1]

I feel the problem is with the local memory usage - but I do not know what numbers I am missing? Is there a limit on local mem usage or something? I know per thread limit for sm_13 is 16kb.

Thanks

pasoleatis · February 27, 2012, 9:39pm

There are 32x1024 registers available. If the total number of required registers per block is larger than that the kernel will not run, because there are not enough resources. I am little confued about this, because it should just spill the excess to the local memor, but it happened to me that some kernels did not run because I used too many registers per thread.

tera · February 27, 2012, 9:54pm

Have you run your program under [font=“Courier New”]cuda-memcheck[/font] to make sure there are no out-of-bounds memory accesses in the 192 threads per block version?

sidxavier · February 27, 2012, 10:57pm

When using cuda-memcheck → the code runs correctly without any errors while the same code fails when i directly run it (gives an “unspecified launch failure”)

My test was - Block size 160, Grid 30

     Case 1 - maxrregcount = 75 - runs normally as well as on cudamemcheck

     Case 2 - maxrregcount = 50 - fails normally but runs on cudamemcheck

tera · February 27, 2012, 11:17pm

Interesting.

Have I understood you correctly that you use the same blocksize in both cases and only vary the [font=“Courier New”]–maxrregcount[/font] compiler argument?

sidxavier · February 27, 2012, 11:24pm

Yes that is correct.

~~Also, I cannot get my code to run atall with sm_20 flag. It compiles but upon running it gives - Invalid device function (err#8). This is for both running it in terminal and through cuda-memcheck.~~

mfatica · February 27, 2012, 11:43pm

You card is a sm13, you can’t run sm_20 code on it.

sidxavier · February 28, 2012, 2:02am

Oh, I am extremely sorry - I miss read that GTX280 was with compute capability 2.0.

I will focus on solving the issues with sm_13 compiled code.

Thanks for pointing that out.

Topic		Replies	Views
Large Thread Size prevents Kernel from running CUDA Programming and Performance	8	925	May 16, 2011
When -maxrregcount option is used, kernel fail to run CUDA Programming and Performance	8	14538	February 10, 2011
too many resources requested for launch what does it exactly mean? CUDA Programming and Performance	3	1541	January 28, 2009
Launching Kernel Fail CUDA Programming and Performance	15	3406	May 28, 2014
Fewer than MaxThreads per Block Fails (Code Included) 400 threads per block fails while 300 successf CUDA Programming and Performance	9	1104	March 24, 2011
Kernel launch failed while number of threads per block smaller than largest number allowed CUDA Programming and Performance cuda	12	2269	October 12, 2021
CUDA Fortran- threads Legacy PGI Compilers	5	4063	April 14, 2011
Max Used Register compile setting affecting kernel launch? CUDA Programming and Performance	9	2377	May 5, 2015
cudaFuncAttributes.maxThreadsPerBlock cudaDeviceProp.maxThreadsPerBlock mismatch problem CUDA Programming and Performance cuda	22	1573	April 8, 2021
Kernel fails to run due to too much lmem, but why? CUDA Programming and Performance	0	2038	June 18, 2009

Kernel Works on lesser number of threads but fails as I increase total threads spawned "CUDA err

Related topics