BUG: Broken register allocation, toolkit 2.3

Toolkit: 2.3
OS: Kubuntu 9.10 64bit
Compiler: gcc-4.3
Card: GeForce 9650M GT
Problem: unacceptable! register pressure

The kernel in attachment compiles to:
ptxas info : Used 60 registers, 44+0 bytes lmem, 2064+16 bytes smem, 4096 bytes cmem[0], 44 bytes cmem[1]

After struggling for a week or so with 60+ register pressure I got pissed and completely rewrote my algorithm. I expanded all macros, and after some sed|awk|perl magic I created a one-line-one-instruction version of my algorithm. (Honestly, I don’t even need a compiler now, an assembler would do.). Doing so I used only 5 variables (plain C, no structs, simply 5 uin32_t variables). Even counting in a few pointer registers (I use shared memory) it could sure as hell fit below 10 registers. But it still compiles to bloody 60 registers+spilling in nvcc. It can’t even cover the register read-after-write latency (not to mention the gmem latency nvcc CAUSED by spilling), which means nvcc completely screwed the job optimizing it.
I tried declaring variables volatile, restructuring code, offloading stuff to smem and even gmem…
Questions are:

  1. What can I do about this?
  2. Could someone compile it in toolkit 3.0 and post the statistics?
    I’m in a middle of a project, I can’t switch toolkit right now, but it would be nice to know if 3.0 does the job as it supposed to…
    cudes.cu (27.5 KB)

Impossible to say. That code of yours is completely incomprehensible, so it is pretty hard to know what to suggest.

Prepare yourself:

avidday@cuda:~$ nvcc -c -arch=sm_13 -Xptxas="-v" cudes.cu 

ptxas info	: Compiling entry function '_Z3DESPjS_' for 'sm_13'

ptxas info	: Used 96 registers, 2064+16 bytes smem, 4096 bytes cmem[0], 72 bytes cmem[1]

As a tip: I am not sure what OS you are using, but if it is POSIX like, then you might want to investigate this. I have 4 different toolkit versions installed simultaneously without issue and can do stuff like this:

avidday@cuda:~$ module list

Currently Loaded Modulefiles:

  1) mpich2/r1.1.1p1

avidday@cuda:~$ module load cuda

cuda	   cuda/2.3   cuda/3.0   cuda/3.0b  

avidday@cuda:~$ module load cuda/2.3

avidday@cuda:~$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2009 NVIDIA Corporation

Built on Thu_Jul_30_09:24:36_PDT_2009

Cuda compilation tools, release 2.3, V0.2.1221

avidday@cuda:~$ module switch cuda/2.3 cuda/3.0

avidday@cuda:~$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2009 NVIDIA Corporation

Built on Fri_Feb_19_19:12:59_PST_2010

Cuda compilation tools, release 3.0, V0.2.1221

avidday@cuda:~$ module list

Currently Loaded Modulefiles:

  1) mpich2/r1.1.1p1   2) cuda/3.0

which is great for regression testing and “tasting” beta versions and new releases without hurting anything.

Can you post source code?

I played with this kernel a bit, but all my known volatile tricks failed.

The kernel “needs” 96 registers in order to not spill to local memory. When I specify --maxrregcount=124 (the highest supported number) 96 regs is the resulting number.

Okay, when you run out of known tricks, invent unknown ones.

buchner@athlonx2:~/cudes> nvcc --keep --ptxas-options=-v --maxrregcount=124 cudes.cu  -I ~/NVIDIA_CUDA_SDK/C/common/inc/

ptxas info	: Compiling entry function '_Z3DESPjS_'

ptxas info	: Used 40 registers, 2100+16 bytes smem, 4096 bytes cmem[0]
__shared__ uint32_t one, two, four, eight, nine, sixteen,

							thirtyone, thirtytwo, thirtysix, sixtytwo, sixtythree;

		if (threadIdx.x == 0)

		{

			one = 0x1;

			two = 0x2;

			four = 0x4;

			eight = 0x8;

			nine = 0x9;

			sixteen = 0x10;

			thirtyone = 0x1f;

			thirtytwo = 0x20;

			thirtysix = 0x24;

			sixtytwo = 0x3e;

			sixtythree = 0x3f;

		}

		__syncthreads();

Replace any occurence of &=0xXY; with &=variablename; from above constants. Shared memory broadcast mechanism applies - it’s fast.

I now hold the record in kernel register reduction. That’s 41.67% of the original registers. Beat that. ;)

EDIT: seems like we need to go below 32 to get somewhere on a Compute capability >= 1.2 device.

To go <= Compute 1.1 we need to get below 16 (forget about that).

Should I invent some more? What’s the pay…

EDIT Nr.2: replacing the constants 164 through 764 with shared memory constants did not help at all. I am running out of ideas.

True, while you can’t see from the source what this code really does, it is written like that to show that whatever it does, it can be done in small amount of registers. Let’s see a small part for example:

[codebox]

t=key0>>14;

t&=0x2;

tmp=t;

t=key0>>12;

t&=0x1;

tmp|=t;

t=key0>>23;

[/codebox]

if we map variables to registers in a 1-1 fashion, we could easily translate this to assembly:

[codebox]

shr t,key0,14

and t,t,0x2

mov tmp,t

shr t,key0,12

and t,t,0x1

or tmp,tmp,t

shr t,key0,23

[/codebox]

since I use no more than 5 variables total it could be compiled to about 5 registers+few pointer registers, but nvcc can’t compile it below 90ish, which I believe is a bug. Whatever the bug really is about, let NVIDIA people figure out, WE can’t. (actually an official assembler is all I really need, I can code a compiler myself.)

I guess no point switching then…

That’s a nice piece of an ugly hack ;)

I mean, as a coder/hacker I love it, but I can’t pull stuff like that in production code, and change tricks every time hardware/driver/toolkit gets an upgrade. CUDA was about portablility, ease of maintanence and all that big-corp-project stuff, wasn’t it? Marketing BS I guess. The technology isn’t ready. With all those compiler bugs and no sane way around them, it really breaks projects, and you don’t know when, why, and how to fix them.

buchner@athlonx2:~/cudes> ptxas --version

ptxas: NVIDIA (R) Ptx optimizing assembler

Copyright (c) 2005-2009 NVIDIA Corporation

Built on Thu_Jul__2_10:56:25_PDT_2009

Cuda compilation tools, release 2.3, V0.2.1221

buchner@athlonx2:~/cudes>

Run nvcc with the --keep option to get an idea about what the .ptx file looks like. Then modify as needed.

There are ways to insert manually compiled or modified ptx files back into the final binary, just don’t ask me about details (I never did stuff that advanced).

Be aware that the .ptx uses single static register assignment (so seeing 5000 registers used is not unusual). It’s ptxas which then reduces the final register count more or less successfully.

NVCC itself is not a compiler, but rather a compiler driver. In verbose mode you see what commands it executes and in what order. These commands could be placed in a Makefile, so you have more control over what happens and when. For example you could skip the step that generates PTX code and instead you use your own PTX file generated by your script and pass that to PTXAS. In the end, you link everything together to a binary.

What is about performance of this code? You may have so many registers, put block size128.

I know how this stuff works :)

I used driver API+decuda/cudasm before, even wrote complete kernels from scratch in assembly alone (but that were small projects). decuda is not updated anymore, and doesn’t support Fermi unfortunately (I saw a thread somewhere that it is possible to disassemble Fermi kernels using objdump+nv50dis. Never tried it.)

Problem is PTX is an intermediate code, ptxas is a compiler (actually a part of the compiler toolchain), regardless of whatever NVIDIA calls it. Bigger problem is it sometimes can’t do it’s job right. And looking at the forums “sometimes” doesn’t seem the right word.

What is missing is an assembler that does not change your code at all. It won’t improve your code, but it won’t break it either. I can handle optimizations myself. Current toolkit obviously can’t.

Thank you for the tries and ideas :).

It is unfortunate that you can’t get full performance out of a decent hardware because of a bugged software. I’m suspending my CUDA projects until NVIDIA decides to release true machine-code assembler (my guess would be: never), or at least provides a way to turn optimizations off.

Or I’ll give Stream a try.

But in this case it might actually work if you did the register assignment manually and just handed the result off to PTXAS.

It’s such a small step from your script that generated the .CU kernel code in the first place.

Learning Stream from scratch might be equally or even more painful.

I share your complaints about the immaturity of the CUDA toolkit. It seems the feature set evolves more quickly than the code can stabilize.

Christian

At 40 registers/thread and 128 threads/block on sm_11 (my testing/notebook card) you get 17% occupancy, which isn’t even enough to cover register read-after-write latency.

At original 96 registers/thread you get 13% occupancy on sm_13, same problem.

if you limit it to use less register and spill to lmem then you have to cover gmem latency as well.

point is you obviously can do better then ptxas.

Also, this is NOT a complete project. And how am I supposed to believe the complete project will perform well, if reduced one sucks?

I mean in compare with cpu.

Actually, I agree with you, I want register keyword. And I want to control register spilling manualy. I do not get this idea about compiler control, compiler could not know much. And cuda is about optimization of small functions.

It is doing about 6x faster then CPU at similar optimization level. In comparison to hand-crafted CPU assembly (core2 duo) it loses though.

I’m getting 15.4x speedup with other, similar kernels (without this issue).

Actually it might work, just a few sed/awk/perl passes will do :)

When I’ll calm down a bit I’ll try it, but I have a bad feeling ptxas will optimize out my changes.

I’ll post when done.

This is one of most dissapointing things in cuda, you do not know, where do your registers go.

Maybe compiling this with various optimization modes, from -G0 through G4 might give
some ideas about what is going on with the register allocation.

In a different thread someone said that -G0 brought register count down, but increased
local memory use a lot.