Problem creating 96 threads with 81 registers each

cbuchner1 · October 14, 2009, 8:55pm

According to what I know, G80/G92 architecture has 8192 registers. Visual profiler tells me my kernel uses 81 registers.

81*96 < 8192

Yet I get the dreaded “too many resources for launch” error when running 96 threads instead of 64. Even when limiting the registers to 72, I get the same error. I don’t understand why.

Shared memory is not an issue (using 24 bytes static shared memory only).

Grid dim is something in the order of (1000,1,1)
block dim is (96, 1, 1)

This is bizarre. I want to achieve better occupancy, but can’t have it.

tmurray · October 14, 2009, 9:02pm

Post a repro?

cbuchner1 · October 14, 2009, 9:06pm

It would be a couple hundred lines long. 4x4 matrix code… Extends on the code in “fun with complex numbers” thread.

I’ll see if I can condense it a little, but I am not too positive about that.

cbuchner1 · October 14, 2009, 9:32pm

Repro attached, I cut all the comments (except the copyright notice) and moved all code into a single .cu file.

Visual C++ project for SDK 2.2.1 included, tested with Toolkit 2.3 on Vista 32 bit. If you don’t use the project file, be sure to compile with --maxrregcount=256 to prevent it from spilling registers to local memory.

To reproduce the problem, increase STACKHEIGHT define from 64 to 96 and see it fail. Shoot me if I missed something really obvious.

part of .cubin reproduced to verify that it uses 81 registers really.

architecture {sm_10}

abiversion   {1}

modname	  {cubin}

code {

	name = _Z10testKernelP14matrixstack4_4ILi64EES1_

	lmem = 0

	smem = 24

	reg  = 81

	bar  = 0

	const {

			segname = const

			segnum  = 1

			offset  = 0

			bytes   = 12

		mem {

			0x7e800000 0x3e800000 0x00000200 

		}

	}

	bincode {

	...

	}

}

Christian

tmurray · October 14, 2009, 9:54pm

arghhhhh why do people still give me repro cases that rely on cutil…

cbuchner1 · October 14, 2009, 9:59pm

You give us an SDK to develop and test code with, we develop with it ;) It’s that simple.

tmurray · October 14, 2009, 10:01pm

the SDK is a misnomer–what we call the SDK is a bunch of code samples, cutil is not production quality software. you really should not use it. the toolkit is what people would generally consider the software development kit, and everything contained therein is production quality software. (don’t ask me why the naming works this way)

cbuchner1 · October 14, 2009, 10:11pm

Removed cutil dependency at special request of tmurray ;)

Reiterating how to reproduce the problem:

compile with --maxrregcount=256, .cubin will require 81 registers per thread

use #define STACKHEIGHT 96 to trigger a problem that shouldn’t be there according to expected register use.

Christian
complextest.zip (6.36 KB)

cbuchner1 · October 14, 2009, 10:13pm

Yes I know, it’s like the 11th commandment.

But I was using cutil only in test code for several matrix classes that have no dependency on cutil at all. So I feel safe ;)

plegresley · October 14, 2009, 10:55pm

The warp allocation granularity is 2 for all compute capability 1.x devices, so you actually need 10752 registers and hence the failure.

tmurray · October 14, 2009, 10:56pm

yeah, occupancy calculator gets this right (I assumed you had looked at that first and that there was a discrepancy). basically, it’s a lot more involved than just (regs per thread * threads per block) < registers.

cbuchner1 · October 14, 2009, 10:58pm

I am not friends with Microsoft Excel, in fact we have something like a hate relationship.

tmurray · October 14, 2009, 11:08pm

cudaFuncGetAttributes returns max number of threads per block, and it shouldn’t lie to you.

cbuchner1 · October 14, 2009, 11:14pm

Thanks, I will check it out.

cbuchner1 · October 15, 2009, 8:27am

Thanks for the info. I am looking into getting a compute 1.2 device (GT 220 for now).

Register pressure seems to be pretty contageous these days in the forum. Seeing multiple threads referring to such issues.

Christian