Register usage Understanding -ptx and -cubin

Hi,

I have a simple kernel. When I use nvcc -cubin, I see that the register usage is 20.

But if I use nvcc -ptx, it outputs this:

       .entry device_compute_distancearray

        {

        .reg .u32 $r1,$r2,$r3,$r4,$r5,$r6,$r7,$r8,$r9,

                $r10,$r11,$r12,$r13,$r14,$r15,$r16,$r17,$r18,$r19,

                $r20,$r21,$r22,$r23,$r24,$r25,$r26,$r27,$r28,$r29,

                $r30,$r31,$r32,$r33,$r34,$r35,$r36,$r37,$r38,$r39,

                $r40;

        .reg .u64 $rd1,$rd2,$rd3,$rd4,$rd5,$rd6,$rd7,$rd8,$rd9,

                $rd10,$rd11,$rd12,$rd13,$rd14,$rd15,$rd16,$rd17,$rd18,$rd19,

                $rd20,$rd21,$rd22,$rd23,$rd24,$rd25,$rd26,$rd27,$rd28,$rd29,

                $rd30,$rd31,$rd32,$rd33,$rd34,$rd35;

        .reg .f32 $f1,$f2,$f3,$f4,$f5,$f6,$f7,$f8,$f9,

                $f10,$f11,$f12,$f13,$f14,$f15,$f16,$f17,$f18;

        .reg .pred $p0,$p1,$p2,$p3,$p4,$p5,$p6,$p7,$p8,$p9, 

                $p10,$p11,$p12,$p13,$p14,$p15,$p16,$p17,$p18;

       ....

Looks like a lot more than 20 to me :blink:

How does register allocation work ? Is the ptx file re-optimized later or … what ?

Some of these registers seem completely wasted btw. For instance, $rd24 is only used once in:

mov.u64         $rd24, $rd11;

So the code writes to $rd24 but never reads it. Weird.

At a glance, it appears to me that the code in the ptx is using static single assignment. The good thing about SSA is that it nicely exposes which values are reused. Later compilation stages obviously take advantage of this and optimize the register usage further. Heck, if you read the guide, it says that register usage may even change between the cubin and the device.

Yes.

Paulius

Thanks.
I suppose there’s no way of finding out/controlling what these late steps do ?

Not really. One reason is that the optimizations are constantly being worked on. Have you tried using the -maxrregcount flag to nvcc? It will reduce the number of registers in the .cubin file, though spilling into local memory will occur for numbers that are too low. Nevertheless, I’ve seen applications where performance went up by 50% when registers were reduced from 18 to 8, even though 40 bytes of local memory (slow) were used to achieve that.

Paulius

Well, yes I did.

with default nvcc I get:
lmem = 0
smem = 13892
reg = 20
And my application works fine.

with --maxrregcount 16:
lmem = 8
smem = 13892
reg = 15
… but in that case my application does not work :-/
Cuda gives me no error, but the application returns almost instantly and the result is all zeroes.

Hmm. Sounds like a launch failure or a run-time kernel crash. Try running with the profiler turned on and then check the log file for kernel times (if none are shown then it’s a launch failure or a crash).

Paulius

OK.

This is the log without the --maxreg option:

method=[ memcopy ] gputime=[ 2047.840 ]

method=[ memcopy ] gputime=[ 214.976 ]

method=[ memcopy ] gputime=[ 214.912 ]

method=[ memcopy ] gputime=[ 1204.864 ]

method=[ device_compute_distancearray ] gputime=[ 1574686.375 ] cputime=[ 5869767.000 ] occupancy=[ 0.500 ]

method=[ memcopy ] gputime=[ 794.720 ]

method=[ memcopy ] gputime=[ 760.384 ]

And this is the log with it:

method=[ memcopy ] gputime=[ 2043.104 ]

method=[ memcopy ] gputime=[ 215.872 ]

method=[ memcopy ] gputime=[ 214.688 ]

method=[ memcopy ] gputime=[ 1207.168 ]

method=[ device_compute_distancearray ] gputime=[ 588.032 ] cputime=[ 612.000 ] occupancy=[ 0.500 ]

method=[ memcopy ] gputime=[ 750.176 ]

method=[ memcopy ] gputime=[ 753.312 ]

If I had to guess, I’d say that your kernel is crashing. If you define _DEBUG flag in your code before compiling, the CUDA_SAFE_CALL might tell you which cuda call is causing the problem (most likely it’s going to be the kernel invocation). If it’s a memory or sync problem, those can often be found if you run your code in debug mode (i.e. on the cpu), since you can then use all the traditional debug tools. You can also try commenting your kernel code, trying to narrow where the crash occurs. This might be painstaking, but you should be able to find the culprit.

Paulius

OK, I tried all of this.

-The problem never happens unless I use the -maxreg option of nvcc
-It does not happen when running on the host (emulation) mode.

When using the -maxreg option, I get a ‘unspecified launch failure’.

It is also interesting to note that ‘CUT_CHECK_ERROR()’ is absolutely useless in my case, unless CUDA_PROFILE is set.

With CUDA_PROFILE the launch failure is caught by CUT_CHECK_ERROR()

Without CUDA_PROFILE, it really looks like all happens asynchronously:

  • a printf statement just after the call to the kernel is printed event though the kernel has not completed its execution (Well, this is what I can see without --maxreg. In that case the printf appears almost immediately even though the kernel takes about 6 seconds to complete)
  • CUT_CHECK_ERROR() reports success, even in case of failure.

The synchronization and proper error checking happen at the next call to cudaMemcpy(). Hence, without CUDA_PROFILE, it is actually CUDA_SAFE_CALL(cudaMemcpy(…)) that reports the launch failure.

Interesting also that cudaThreadSynchronize() seems to have no effect.

A couple of things:

  1. CUDA_SAFE_CALL(…) and other macros are compiled to nothing unless _DEBUG is defined (you can see this by checking the cutil.h file). That’s to ensure that max performance is achieved in non-debug mode.

  2. I think I have an idea of what the problem is. You mentioned that your kernel takes about 6s after register optimizations. Well, currently the kernels are restricted to under 5s due to the watchdog mechanisms. If the time is exceeded the kernel is killed by the system, which would explain why you get invalid results.
    Now, the time with reg optimizations is longer probably because your kernel uses 8 bytes of local memory (which is slow and uncached, just like global memory). If possible, I’d suggest verifying this by running the code on a smaller problem and measuring the time with and without -maxrregcount.
    Unfortunately, register spilling into local memory occurs when the compiler cannot reduce the number of registers further (every program has its limit, it’s the matter how close the compiler gets to this limit).

Paulius

Yes of course, I have _DEBUG already defined.

Actually I run under Linux without X, and have been able to run kernels flawlessly up to ~8.5 seconds.

And as I wrote before, without -maxreg my kernel runs for ~6 seconds, results OK.

But WITH -maxregs, it aborts almost instantly. So this is not a time exceeded issue.

I’ve tried anyway with a smaller case, that takes ~0.8 seconds without -maxreg. It stills fails with ‘unspecified launch failure’ when I add the -maxreg to nvcc.

Simon