The kernel in attachment compiles to:
ptxas info : Used 60 registers, 44+0 bytes lmem, 2064+16 bytes smem, 4096 bytes cmem[0], 44 bytes cmem[1]
After struggling for a week or so with 60+ register pressure I got pissed and completely rewrote my algorithm. I expanded all macros, and after some sed|awk|perl magic I created a one-line-one-instruction version of my algorithm. (Honestly, I don’t even need a compiler now, an assembler would do.). Doing so I used only 5 variables (plain C, no structs, simply 5 uin32_t variables). Even counting in a few pointer registers (I use shared memory) it could sure as hell fit below 10 registers. But it still compiles to bloody 60 registers+spilling in nvcc. It can’t even cover the register read-after-write latency (not to mention the gmem latency nvcc CAUSED by spilling), which means nvcc completely screwed the job optimizing it.
I tried declaring variables volatile, restructuring code, offloading stuff to smem and even gmem…
Questions are:
What can I do about this?
Could someone compile it in toolkit 3.0 and post the statistics?
I’m in a middle of a project, I can’t switch toolkit right now, but it would be nice to know if 3.0 does the job as it supposed to… cudes.cu (27.5 KB)
Impossible to say. That code of yours is completely incomprehensible, so it is pretty hard to know what to suggest.
Prepare yourself:
avidday@cuda:~$ nvcc -c -arch=sm_13 -Xptxas="-v" cudes.cu
ptxas info : Compiling entry function '_Z3DESPjS_' for 'sm_13'
ptxas info : Used 96 registers, 2064+16 bytes smem, 4096 bytes cmem[0], 72 bytes cmem[1]
As a tip: I am not sure what OS you are using, but if it is POSIX like, then you might want to investigate this. I have 4 different toolkit versions installed simultaneously without issue and can do stuff like this:
avidday@cuda:~$ module list
Currently Loaded Modulefiles:
1) mpich2/r1.1.1p1
avidday@cuda:~$ module load cuda
cuda cuda/2.3 cuda/3.0 cuda/3.0b
avidday@cuda:~$ module load cuda/2.3
avidday@cuda:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2009 NVIDIA Corporation
Built on Thu_Jul_30_09:24:36_PDT_2009
Cuda compilation tools, release 2.3, V0.2.1221
avidday@cuda:~$ module switch cuda/2.3 cuda/3.0
avidday@cuda:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2009 NVIDIA Corporation
Built on Fri_Feb_19_19:12:59_PST_2010
Cuda compilation tools, release 3.0, V0.2.1221
avidday@cuda:~$ module list
Currently Loaded Modulefiles:
1) mpich2/r1.1.1p1 2) cuda/3.0
which is great for regression testing and “tasting” beta versions and new releases without hurting anything.
I played with this kernel a bit, but all my known volatile tricks failed.
The kernel “needs” 96 registers in order to not spill to local memory. When I specify --maxrregcount=124 (the highest supported number) 96 regs is the resulting number.
True, while you can’t see from the source what this code really does, it is written like that to show that whatever it does, it can be done in small amount of registers. Let’s see a small part for example:
[codebox]
…
t=key0>>14;
t&=0x2;
tmp=t;
t=key0>>12;
t&=0x1;
tmp|=t;
t=key0>>23;
…
[/codebox]
if we map variables to registers in a 1-1 fashion, we could easily translate this to assembly:
[codebox]
…
shr t,key0,14
and t,t,0x2
mov tmp,t
shr t,key0,12
and t,t,0x1
or tmp,tmp,t
shr t,key0,23
…
[/codebox]
since I use no more than 5 variables total it could be compiled to about 5 registers+few pointer registers, but nvcc can’t compile it below 90ish, which I believe is a bug. Whatever the bug really is about, let NVIDIA people figure out, WE can’t. (actually an official assembler is all I really need, I can code a compiler myself.)
I guess no point switching then…
That’s a nice piece of an ugly hack ;)
I mean, as a coder/hacker I love it, but I can’t pull stuff like that in production code, and change tricks every time hardware/driver/toolkit gets an upgrade. CUDA was about portablility, ease of maintanence and all that big-corp-project stuff, wasn’t it? Marketing BS I guess. The technology isn’t ready. With all those compiler bugs and no sane way around them, it really breaks projects, and you don’t know when, why, and how to fix them.
buchner@athlonx2:~/cudes> ptxas --version
ptxas: NVIDIA (R) Ptx optimizing assembler
Copyright (c) 2005-2009 NVIDIA Corporation
Built on Thu_Jul__2_10:56:25_PDT_2009
Cuda compilation tools, release 2.3, V0.2.1221
buchner@athlonx2:~/cudes>
Run nvcc with the --keep option to get an idea about what the .ptx file looks like. Then modify as needed.
There are ways to insert manually compiled or modified ptx files back into the final binary, just don’t ask me about details (I never did stuff that advanced).
Be aware that the .ptx uses single static register assignment (so seeing 5000 registers used is not unusual). It’s ptxas which then reduces the final register count more or less successfully.
NVCC itself is not a compiler, but rather a compiler driver. In verbose mode you see what commands it executes and in what order. These commands could be placed in a Makefile, so you have more control over what happens and when. For example you could skip the step that generates PTX code and instead you use your own PTX file generated by your script and pass that to PTXAS. In the end, you link everything together to a binary.
I used driver API+decuda/cudasm before, even wrote complete kernels from scratch in assembly alone (but that were small projects). decuda is not updated anymore, and doesn’t support Fermi unfortunately (I saw a thread somewhere that it is possible to disassemble Fermi kernels using objdump+nv50dis. Never tried it.)
Problem is PTX is an intermediate code, ptxas is a compiler (actually a part of the compiler toolchain), regardless of whatever NVIDIA calls it. Bigger problem is it sometimes can’t do it’s job right. And looking at the forums “sometimes” doesn’t seem the right word.
What is missing is an assembler that does not change your code at all. It won’t improve your code, but it won’t break it either. I can handle optimizations myself. Current toolkit obviously can’t.
Thank you for the tries and ideas :).
It is unfortunate that you can’t get full performance out of a decent hardware because of a bugged software. I’m suspending my CUDA projects until NVIDIA decides to release true machine-code assembler (my guess would be: never), or at least provides a way to turn optimizations off.
At 40 registers/thread and 128 threads/block on sm_11 (my testing/notebook card) you get 17% occupancy, which isn’t even enough to cover register read-after-write latency.
At original 96 registers/thread you get 13% occupancy on sm_13, same problem.
if you limit it to use less register and spill to lmem then you have to cover gmem latency as well.
point is you obviously can do better then ptxas.
Also, this is NOT a complete project. And how am I supposed to believe the complete project will perform well, if reduced one sucks?
Actually, I agree with you, I want register keyword. And I want to control register spilling manualy. I do not get this idea about compiler control, compiler could not know much. And cuda is about optimization of small functions.