Kernel Works on lesser number of threads but fails as I increase total threads spawned "CUDA err

Hi all,

I have a ‘persistent threads’ type implementation to handle a worklist (producer-consumer) type algorithm.
It is a heavy kernel,has many local variables and double precision computation.
To sum it up - I have kernel that uses lots of register and local memory. It fails with an unspecified launch error when i spawn more threads (works of 128 X 30 but fails for 192 X 30 with unspecified launch error).

More description -
I am using GTX280 with Cuda compilation tools, release 4.1, V0.2.1221 AND NVIDIA UNIX x86_64 Kernel Module 285.05.32

My kernel gives perfect results when compiled with upto 128 threads per block and 30 blocks.
By default (without using maxrregcount ) the ptax info obtained FOR “sm_13” is –
ptxas info : Used 110 registers, 2184+0 bytes lmem, 112+16 bytes smem, 156 bytes cmem[1]

As you can see - its currently a not very well written code in a way that it has heavy register and local memory usage.

In order to increase the number of active blocks (warps) I used maxrregcount = 75 (to reduce reg usage and thus increase block size to 192 from 128). I get the following ptax info corresponding to this build.
ptxas info : Used 75 registers, 2296+0 bytes lmem, 112+16 bytes smem, 140 bytes cmem[1]
AND the code fails with an error - ERROR: cudaGetLastError() returned unspecified launch failure (err#4)

[b]Is this failure due to the local/constant memory usage? which goes out of bound upon increase in total threads spawned?

Thanks
Sid.

Also when I run with block_size 160 and gridSZ = 30
It works for maxrregcount = 75 with ptxas info : Used 75 registers, 2296+0 bytes lmem, 112+16 bytes smem, 140 bytes cmem[1]
But fails for maxrregcount = 50 with ptxas info : Used 50 registers, 3032+0 bytes lmem, 112+16 bytes smem, 140 bytes cmem[1]

I feel the problem is with the local memory usage - but I do not know what numbers I am missing? Is there a limit on local mem usage or something? I know per thread limit for sm_13 is 16kb.

Thanks

There are 32x1024 registers available. If the total number of required registers per block is larger than that the kernel will not run, because there are not enough resources. I am little confued about this, because it should just spill the excess to the local memor, but it happened to me that some kernels did not run because I used too many registers per thread.

Have you run your program under [font=“Courier New”]cuda-memcheck[/font] to make sure there are no out-of-bounds memory accesses in the 192 threads per block version?

When using cuda-memcheck → the code runs correctly without any errors while the same code fails when i directly run it (gives an “unspecified launch failure”)

My test was - Block size 160, Grid 30

     Case 1 - maxrregcount = 75 - runs normally as well as on cudamemcheck

     Case 2 - maxrregcount = 50 - fails normally but runs on cudamemcheck

Interesting.

Have I understood you correctly that you use the same blocksize in both cases and only vary the [font=“Courier New”]–maxrregcount[/font] compiler argument?

Yes that is correct.

Also, I cannot get my code to run atall with sm_20 flag. It compiles but upon running it gives - Invalid device function (err#8). This is for both running it in terminal and through cuda-memcheck.

You card is a sm13, you can’t run sm_20 code on it.

Oh, I am extremely sorry - I miss read that GTX280 was with compute capability 2.0.

I will focus on solving the issues with sm_13 compiled code.

Thanks for pointing that out.