<500 threads and out of resources? 9600GT should support 512 threads/block


I’ve got following question: I’m only launching little bit less than 500 threads (493 to be precise), and only use 1 single block.

However, at execution in debug mode I get the error “cudaErrorLaunchOutOfResources” (msg: too many resources requested for launch.)

I’m using a GeForce 9600GT, which should support 512 threads per block.

So my questions are:

  • how comes I’m already running out of resources?

  • does it maybe have sthg to do with the allocated registers per thread? if so…how can I find out how many registers are being used right now?
    (note: my kernel code is fairly simple…uses only global memory and performs some computations in a loop…I could post it if necessary for your analysis)

  • in emudebug mode everything’s fine; does this mode not check the accepted launch configurations at all? (if it’s an issue with the register allocation however, I suppose it cannot figure out how many registers will be used on the actual device?)

Any help is appreciated,

That looks like you’ve run out of registers.

Are you using the NVIDIA SDK Makefile thing? Try “make show=1”

and the option --ptxas-options=-v to your compilation line, and you’ll see how many registers are being used.

Thanks so far for your replies.

No, I’m not using the make file - I’m building using VS 2005 on Windows XP for now.
I’ll give it a shot tomorrow with this command line option.

Also - would the profiler give me some helpful information on that, too? (haven’t used the profiler yet)?

  • michael

The compiler reports on the execution of the kernel, so as its not starting, it wont do much good.
The information in the cubin file is what you want, since you are most likely running out of registers…
Instead of using one block, use 2 od 256, or 4 of 128. I bet itll launch.

Didn’t you mean profiler?

probably…but I got the picture anyway. will try it soon and let u know…

Doh! yes, of course.

Must not post when waking up…

well thanks guys - changing to a different execution configuration did the trick…my kernel really used too many registers

however, out of curiosity, I’d have following follow-up question:

  • provided that I would only like to use a single block (despite the fact that this is of course not a good idea for overall performance) - shouldn’t it be possible to compute the maximum number of allowed threads within that block then (given the # of used registers, shared mem, and my GPU capabilites)?

  • I tried to compute it like this and compared it to the actual results from practice - could sbdy verify this?:

  • my GPU is a GeForce 9600 GT, with following relevant resources:

Total amount of shared memory per block:       16,384 bytes

	Total number of registers available per block: 8,192
  • the output of nvcc-compiler shows following resource usage for my kernel:
1>ptxas info    : Compiling entry function '__globfunc__Z19SomeDeviceFctP6float2S0_S0_i'

	1>ptxas info    : Used 17 registers, 32+28 bytes smem, 16 bytes cmem[1]
  • therefore, I expected the maximum number of allowed threads (if only a single block is utilized) to be:
- either 8,192 total registers / (17 registers/kernel) = 481.8 threads

- or 16,384 bytes smem / (32 bytes smem/kernel) = 512 threads

- take the minimum number of these two, and round it down to the next multiple of 32 (which reflects the warp size) - in this case, this is roundDown32(481.8) = 448
  • I verified this in practice, and indeed, 448 threads in a single block was the last successful execution configuration

comments welcome!

only 2 comments:

  • the cuda occupancy calculator does these kinds of calculations for you, together with some nice graphs
  • you can launch as many blocks as you want (up to 65535x65535), as the resource usage per block is the one that determines if your kernel will launch.