Problem creating 96 threads with 81 registers each

According to what I know, G80/G92 architecture has 8192 registers. Visual profiler tells me my kernel uses 81 registers.

81*96 < 8192

Yet I get the dreaded “too many resources for launch” error when running 96 threads instead of 64. Even when limiting the registers to 72, I get the same error. I don’t understand why.

Shared memory is not an issue (using 24 bytes static shared memory only).

Grid dim is something in the order of (1000,1,1)
block dim is (96, 1, 1)

This is bizarre. I want to achieve better occupancy, but can’t have it.

Post a repro?

It would be a couple hundred lines long. 4x4 matrix code… Extends on the code in “fun with complex numbers” thread.

I’ll see if I can condense it a little, but I am not too positive about that.

Repro attached, I cut all the comments (except the copyright notice) and moved all code into a single .cu file.

Visual C++ project for SDK 2.2.1 included, tested with Toolkit 2.3 on Vista 32 bit. If you don’t use the project file, be sure to compile with --maxrregcount=256 to prevent it from spilling registers to local memory.

To reproduce the problem, increase STACKHEIGHT #define from 64 to 96 and see it fail. Shoot me if I missed something really obvious.

part of .cubin reproduced to verify that it uses 81 registers really.

architecture {sm_10}

abiversion   {1}

modname	  {cubin}

code {

	name = _Z10testKernelP14matrixstack4_4ILi64EES1_

	lmem = 0

	smem = 24

	reg  = 81

	bar  = 0

	const {

			segname = const

			segnum  = 1

			offset  = 0

			bytes   = 12

		mem {

			0x7e800000 0x3e800000 0x00000200 



	bincode {





arghhhhh why do people still give me repro cases that rely on cutil…

You give us an SDK to develop and test code with, we develop with it ;) It’s that simple.

the SDK is a misnomer–what we call the SDK is a bunch of code samples, cutil is not production quality software. you really should not use it. the toolkit is what people would generally consider the software development kit, and everything contained therein is production quality software. (don’t ask me why the naming works this way)

Removed cutil dependency at special request of tmurray ;)

Reiterating how to reproduce the problem:

compile with --maxrregcount=256, .cubin will require 81 registers per thread

use #define STACKHEIGHT 96 to trigger a problem that shouldn’t be there according to expected register use.

Christian (6.36 KB)

Yes I know, it’s like the 11th commandment.

But I was using cutil only in test code for several matrix classes that have no dependency on cutil at all. So I feel safe ;)

The warp allocation granularity is 2 for all compute capability 1.x devices, so you actually need 10752 registers and hence the failure.

yeah, occupancy calculator gets this right (I assumed you had looked at that first and that there was a discrepancy). basically, it’s a lot more involved than just (regs per thread * threads per block) < registers.

I am not friends with Microsoft Excel, in fact we have something like a hate relationship.

cudaFuncGetAttributes returns max number of threads per block, and it shouldn’t lie to you.

Thanks, I will check it out.

Thanks for the info. I am looking into getting a compute 1.2 device (GT 220 for now).

Register pressure seems to be pretty contageous these days in the forum. Seeing multiple threads referring to such issues.