According to what I know, G80/G92 architecture has 8192 registers. Visual profiler tells me my kernel uses 81 registers.
81*96 < 8192
Yet I get the dreaded “too many resources for launch” error when running 96 threads instead of 64. Even when limiting the registers to 72, I get the same error. I don’t understand why.
Shared memory is not an issue (using 24 bytes static shared memory only).
Grid dim is something in the order of (1000,1,1)
block dim is (96, 1, 1)
This is bizarre. I want to achieve better occupancy, but can’t have it.
Repro attached, I cut all the comments (except the copyright notice) and moved all code into a single .cu file.
Visual C++ project for SDK 2.2.1 included, tested with Toolkit 2.3 on Vista 32 bit. If you don’t use the project file, be sure to compile with --maxrregcount=256 to prevent it from spilling registers to local memory.
To reproduce the problem, increase STACKHEIGHT define from 64 to 96 and see it fail. Shoot me if I missed something really obvious.
part of .cubin reproduced to verify that it uses 81 registers really.
the SDK is a misnomer–what we call the SDK is a bunch of code samples, cutil is not production quality software. you really should not use it. the toolkit is what people would generally consider the software development kit, and everything contained therein is production quality software. (don’t ask me why the naming works this way)
yeah, occupancy calculator gets this right (I assumed you had looked at that first and that there was a discrepancy). basically, it’s a lot more involved than just (regs per thread * threads per block) < registers.