BUG: Broken register allocation, toolkit 2.3

wwa · May 6, 2010, 3:49pm

Toolkit: 2.3
OS: Kubuntu 9.10 64bit
Compiler: gcc-4.3
Card: GeForce 9650M GT
Problem: unacceptable! register pressure

The kernel in attachment compiles to:
ptxas info : Used 60 registers, 44+0 bytes lmem, 2064+16 bytes smem, 4096 bytes cmem[0], 44 bytes cmem[1]

After struggling for a week or so with 60+ register pressure I got pissed and completely rewrote my algorithm. I expanded all macros, and after some sed|awk|perl magic I created a one-line-one-instruction version of my algorithm. (Honestly, I don’t even need a compiler now, an assembler would do.). Doing so I used only 5 variables (plain C, no structs, simply 5 uin32_t variables). Even counting in a few pointer registers (I use shared memory) it could sure as hell fit below 10 registers. But it still compiles to bloody 60 registers+spilling in nvcc. It can’t even cover the register read-after-write latency (not to mention the gmem latency nvcc CAUSED by spilling), which means nvcc completely screwed the job optimizing it.
I tried declaring variables volatile, restructuring code, offloading stuff to smem and even gmem…
Questions are:

What can I do about this?
Could someone compile it in toolkit 3.0 and post the statistics?
I’m in a middle of a project, I can’t switch toolkit right now, but it would be nice to know if 3.0 does the job as it supposed to…
cudes.cu (27.5 KB)

avidday · May 6, 2010, 4:01pm

Impossible to say. That code of yours is completely incomprehensible, so it is pretty hard to know what to suggest.

Prepare yourself:

avidday@cuda:~$ nvcc -c -arch=sm_13 -Xptxas="-v" cudes.cu 

ptxas info	: Compiling entry function '_Z3DESPjS_' for 'sm_13'

ptxas info	: Used 96 registers, 2064+16 bytes smem, 4096 bytes cmem[0], 72 bytes cmem[1]

As a tip: I am not sure what OS you are using, but if it is POSIX like, then you might want to investigate this. I have 4 different toolkit versions installed simultaneously without issue and can do stuff like this:

avidday@cuda:~$ module list

Currently Loaded Modulefiles:

  1) mpich2/r1.1.1p1

avidday@cuda:~$ module load cuda

cuda	   cuda/2.3   cuda/3.0   cuda/3.0b  

avidday@cuda:~$ module load cuda/2.3

avidday@cuda:~$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2009 NVIDIA Corporation

Built on Thu_Jul_30_09:24:36_PDT_2009

Cuda compilation tools, release 2.3, V0.2.1221

avidday@cuda:~$ module switch cuda/2.3 cuda/3.0

avidday@cuda:~$ nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver

Copyright (c) 2005-2009 NVIDIA Corporation

Built on Fri_Feb_19_19:12:59_PST_2010

Cuda compilation tools, release 3.0, V0.2.1221

avidday@cuda:~$ module list

Currently Loaded Modulefiles:

  1) mpich2/r1.1.1p1   2) cuda/3.0

which is great for regression testing and “tasting” beta versions and new releases without hurting anything.

Lev · May 6, 2010, 4:03pm

Can you post source code?

cbuchner1 · May 6, 2010, 4:22pm

I played with this kernel a bit, but all my known volatile tricks failed.

The kernel “needs” 96 registers in order to not spill to local memory. When I specify --maxrregcount=124 (the highest supported number) 96 regs is the resulting number.

cbuchner1 · May 6, 2010, 4:26pm

Okay, when you run out of known tricks, invent unknown ones.

buchner@athlonx2:~/cudes> nvcc --keep --ptxas-options=-v --maxrregcount=124 cudes.cu  -I ~/NVIDIA_CUDA_SDK/C/common/inc/

ptxas info	: Compiling entry function '_Z3DESPjS_'

ptxas info	: Used 40 registers, 2100+16 bytes smem, 4096 bytes cmem[0]

__shared__ uint32_t one, two, four, eight, nine, sixteen,

							thirtyone, thirtytwo, thirtysix, sixtytwo, sixtythree;

		if (threadIdx.x == 0)

		{

			one = 0x1;

			two = 0x2;

			four = 0x4;

			eight = 0x8;

			nine = 0x9;

			sixteen = 0x10;

			thirtyone = 0x1f;

			thirtytwo = 0x20;

			thirtysix = 0x24;

			sixtytwo = 0x3e;

			sixtythree = 0x3f;

		}

		__syncthreads();

Replace any occurence of &=0xXY; with &=variablename; from above constants. Shared memory broadcast mechanism applies - it’s fast.

I now hold the record in kernel register reduction. That’s 41.67% of the original registers. Beat that. ;)

EDIT: seems like we need to go below 32 to get somewhere on a Compute capability >= 1.2 device.

To go <= Compute 1.1 we need to get below 16 (forget about that).

Should I invent some more? What’s the pay…

EDIT Nr.2: replacing the constants 164 through 764 with shared memory constants did not help at all. I am running out of ideas.

wwa · May 6, 2010, 4:51pm

True, while you can’t see from the source what this code really does, it is written like that to show that whatever it does, it can be done in small amount of registers. Let’s see a small part for example:

[codebox]

…

t=key0>>14;

t&=0x2;

tmp=t;

t=key0>>12;

t&=0x1;

tmp|=t;

t=key0>>23;

…

[/codebox]

if we map variables to registers in a 1-1 fashion, we could easily translate this to assembly:

[codebox]

…

shr t,key0,14

and t,t,0x2

mov tmp,t

shr t,key0,12

and t,t,0x1

or tmp,tmp,t

shr t,key0,23

…

[/codebox]

since I use no more than 5 variables total it could be compiled to about 5 registers+few pointer registers, but nvcc can’t compile it below 90ish, which I believe is a bug. Whatever the bug really is about, let NVIDIA people figure out, WE can’t. (actually an official assembler is all I really need, I can code a compiler myself.)

Prepare yourself:

avidday@cuda:~$ nvcc -c -arch=sm_13 -Xptxas="-v" cudes.cu 

ptxas info	: Compiling entry function '_Z3DESPjS_' for 'sm_13'

ptxas info	: Used 96 registers, 2064+16 bytes smem, 4096 bytes cmem[0], 72 bytes cmem[1]

I guess no point switching then…

That’s a nice piece of an ugly hack ;)

I mean, as a coder/hacker I love it, but I can’t pull stuff like that in production code, and change tricks every time hardware/driver/toolkit gets an upgrade. CUDA was about portablility, ease of maintanence and all that big-corp-project stuff, wasn’t it? Marketing BS I guess. The technology isn’t ready. With all those compiler bugs and no sane way around them, it really breaks projects, and you don’t know when, why, and how to fix them.

cbuchner1 · May 6, 2010, 4:58pm

buchner@athlonx2:~/cudes> ptxas --version

ptxas: NVIDIA (R) Ptx optimizing assembler

Copyright (c) 2005-2009 NVIDIA Corporation

Built on Thu_Jul__2_10:56:25_PDT_2009

Cuda compilation tools, release 2.3, V0.2.1221

buchner@athlonx2:~/cudes>

Run nvcc with the --keep option to get an idea about what the .ptx file looks like. Then modify as needed.

There are ways to insert manually compiled or modified ptx files back into the final binary, just don’t ask me about details (I never did stuff that advanced).

Be aware that the .ptx uses single static register assignment (so seeing 5000 registers used is not unusual). It’s ptxas which then reduces the final register count more or less successfully.

NVCC itself is not a compiler, but rather a compiler driver. In verbose mode you see what commands it executes and in what order. These commands could be placed in a Makefile, so you have more control over what happens and when. For example you could skip the step that generates PTX code and instead you use your own PTX file generated by your script and pass that to PTXAS. In the end, you link everything together to a binary.

Lev · May 6, 2010, 5:53pm

What is about performance of this code? You may have so many registers, put block size128.

wwa · May 6, 2010, 5:56pm

buchner@athlonx2:~/cudes> ptxas --version

ptxas: NVIDIA (R) Ptx optimizing assembler

Copyright (c) 2005-2009 NVIDIA Corporation

Built on Thu_Jul__2_10:56:25_PDT_2009

Cuda compilation tools, release 2.3, V0.2.1221

buchner@athlonx2:~/cudes>
Run nvcc with the --keep option to get an idea about how the .ptx file looks like. Then modify as needed.

There are ways to insert manually compiled or modified ptx files back into the final binary, just don’t ask me (I never did stuff that advanced).

Be aware that the .ptx uses single static register assignment (so seeing 5000 registers used is not unusual). It’s ptxas which then reduces the final register count more or less successfully.

NVCC itself is not a compiler, but rather a compiler driver. In verbose mode you see what commands it executes and in what order. These commands could be placed in a Makefile, so you have more control over what happens and when. For example you could skip the step that generates PTX code and instead you use your own PTX file generated by your script and pass that to PTXAS. In the end, you link everything together to a binary.

I know how this stuff works :)

I used driver API+decuda/cudasm before, even wrote complete kernels from scratch in assembly alone (but that were small projects). decuda is not updated anymore, and doesn’t support Fermi unfortunately (I saw a thread somewhere that it is possible to disassemble Fermi kernels using objdump+nv50dis. Never tried it.)

Problem is PTX is an intermediate code, ptxas is a compiler (actually a part of the compiler toolchain), regardless of whatever NVIDIA calls it. Bigger problem is it sometimes can’t do it’s job right. And looking at the forums “sometimes” doesn’t seem the right word.

What is missing is an assembler that does not change your code at all. It won’t improve your code, but it won’t break it either. I can handle optimizations myself. Current toolkit obviously can’t.

Thank you for the tries and ideas :).

It is unfortunate that you can’t get full performance out of a decent hardware because of a bugged software. I’m suspending my CUDA projects until NVIDIA decides to release true machine-code assembler (my guess would be: never), or at least provides a way to turn optimizations off.

Or I’ll give Stream a try.

cbuchner1 · May 6, 2010, 6:03pm

But in this case it might actually work if you did the register assignment manually and just handed the result off to PTXAS.

It’s such a small step from your script that generated the .CU kernel code in the first place.

Learning Stream from scratch might be equally or even more painful.

I share your complaints about the immaturity of the CUDA toolkit. It seems the feature set evolves more quickly than the code can stabilize.

Christian

wwa · May 6, 2010, 6:07pm

At 40 registers/thread and 128 threads/block on sm_11 (my testing/notebook card) you get 17% occupancy, which isn’t even enough to cover register read-after-write latency.

At original 96 registers/thread you get 13% occupancy on sm_13, same problem.

if you limit it to use less register and spill to lmem then you have to cover gmem latency as well.

point is you obviously can do better then ptxas.

Also, this is NOT a complete project. And how am I supposed to believe the complete project will perform well, if reduced one sucks?

Lev · May 6, 2010, 6:11pm

I mean in compare with cpu.

Lev · May 6, 2010, 6:11pm

Actually, I agree with you, I want register keyword. And I want to control register spilling manualy. I do not get this idea about compiler control, compiler could not know much. And cuda is about optimization of small functions.

wwa · May 6, 2010, 6:26pm

It is doing about 6x faster then CPU at similar optimization level. In comparison to hand-crafted CPU assembly (core2 duo) it loses though.

I’m getting 15.4x speedup with other, similar kernels (without this issue).

Actually it might work, just a few sed/awk/perl passes will do :)

When I’ll calm down a bit I’ll try it, but I have a bad feeling ptxas will optimize out my changes.

I’ll post when done.

Lev · May 6, 2010, 6:35pm

This is one of most dissapointing things in cuda, you do not know, where do your registers go.

cbuchner1 · May 10, 2010, 10:57am

Maybe compiling this with various optimization modes, from -G0 through G4 might give
some ideas about what is going on with the register allocation.

In a different thread someone said that -G0 brought register count down, but increased
local memory use a lot.

Topic		Replies	Views
Cuda compiler will optimize code to use more registers than available by attempting to cache parameters CUDA Programming and Performance	12	2286	November 14, 2017
Cuda compiler loop unroll bug? CUDA Programming and Performance	14	2428	October 25, 2017
On the register allocation optimization of cuda compiler CUDA Programming and Performance	12	3253	January 20, 2019
Ptxas compiler speed. CUDA Programming and Performance	23	12152	December 20, 2012
A strange phenomenon on register allocation. How to reduce register pressure? CUDA Programming and Performance cuda	14	1379	March 25, 2022
How to optimize my cuda code? CUDA Programming and Performance	14	1938	June 28, 2023
nvcc/ptxas under-utilizing registers for arrays CUDA Programming and Performance	13	3180	June 3, 2015
Getting nvcc to consolidate registers CUDA Programming and Performance	19	19515	November 19, 2012
running code from cudatoolkit 3.2 to 4.0 -- ptxas error CUDA Programming and Performance	3	3964	August 17, 2011
Tricks to fight register pressure or how I got down from 29 to 15 registers. CUDA Programming and Performance	14	16341	March 14, 2022

BUG: Broken register allocation, toolkit 2.3

Related topics