I’ve got following question: I’m only launching little bit less than 500 threads (493 to be precise), and only use 1 single block.
However, at execution in debug mode I get the error “cudaErrorLaunchOutOfResources” (msg: too many resources requested for launch.)
I’m using a GeForce 9600GT, which should support 512 threads per block.
So my questions are:
how comes I’m already running out of resources?
does it maybe have sthg to do with the allocated registers per thread? if so…how can I find out how many registers are being used right now?
(note: my kernel code is fairly simple…uses only global memory and performs some computations in a loop…I could post it if necessary for your analysis)
in emudebug mode everything’s fine; does this mode not check the accepted launch configurations at all? (if it’s an issue with the register allocation however, I suppose it cannot figure out how many registers will be used on the actual device?)
The compiler reports on the execution of the kernel, so as its not starting, it wont do much good.
The information in the cubin file is what you want, since you are most likely running out of registers…
Instead of using one block, use 2 od 256, or 4 of 128. I bet itll launch.
well thanks guys - changing to a different execution configuration did the trick…my kernel really used too many registers
however, out of curiosity, I’d have following follow-up question:
provided that I would only like to use a single block (despite the fact that this is of course not a good idea for overall performance) - shouldn’t it be possible to compute the maximum number of allowed threads within that block then (given the # of used registers, shared mem, and my GPU capabilites)?
I tried to compute it like this and compared it to the actual results from practice - could sbdy verify this?:
my GPU is a GeForce 9600 GT, with following relevant resources:
Total amount of shared memory per block: 16,384 bytes
Total number of registers available per block: 8,192
the output of nvcc-compiler shows following resource usage for my kernel:
1>ptxas info : Compiling entry function '__globfunc__Z19SomeDeviceFctP6float2S0_S0_i'
1>ptxas info : Used 17 registers, 32+28 bytes smem, 16 bytes cmem[1]
therefore, I expected the maximum number of allowed threads (if only a single block is utilized) to be:
- either 8,192 total registers / (17 registers/kernel) = 481.8 threads
- or 16,384 bytes smem / (32 bytes smem/kernel) = 512 threads
- take the minimum number of these two, and round it down to the next multiple of 32 (which reflects the warp size) - in this case, this is roundDown32(481.8) = 448
I verified this in practice, and indeed, 448 threads in a single block was the last successful execution configuration