register count frustration

nvcc -g -G -Xptxas=-v --maxrregcount=64 -arch=sm_20 common_libs.o timing.o dataIO.o gpu_testing.cu -o gpu_testing

ptxas info : Compiling entry function ‘_Z14UT_k_rmsd_calcjjjP6float3Pf’ for ‘sm_20’

ptxas warning : Too big maxrregcount value specified 64, will be ignored

ptxas info : Function properties for _Z14UT_k_rmsd_calcjjjP6float3Pf

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for _ZSt4fabsf

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for Z9calc_rmsdjP6float3S0

40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads

ptxas info : Function properties for _ZSt4sqrtf

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Used 44 registers, 8048+0 bytes smem, 64 bytes cmem[0]

ptxas info : Compiling entry function ‘_Z22UT_k_center_conformersjjP6float3’ for ‘sm_20’

ptxas warning : Too big maxrregcount value specified 64, will be ignored

ptxas info : Function properties for _Z22UT_k_center_conformersjjP6float3

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for ZmIR6float3RKS

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for _Z17center_conformersjjP6float3

40 bytes stack frame, 20 bytes spill stores, 20 bytes spill loads

ptxas info : Function properties for _ZdVR6float3RKf

16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads

ptxas info : Function properties for ZpLR6float3RKS

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Used 20 registers, 48 bytes cmem[0]

ptxas info : Compiling entry function ‘Z20k_update_point_rmsdsjjjPjP6float3PfS_S2_S_S2_S_jS2’ for ‘sm_20’

ptxas warning : Too big maxrregcount value specified 64, will be ignored

ptxas info : Function properties for Z20k_update_point_rmsdsjjjPjP6float3PfS_S2_S_S2_S_jS2

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for _Z21parallel_ExcPrefixSumjPj

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for _ZSt4fabsf

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for Z12parallel_MaxILj512EfjEvjPT0_PT1

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for Z9calc_rmsdjP6float3S0

40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads

ptxas info : Function properties for _ZSt4sqrtf

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for _Z28parallel_binary_scatter_sortjPj

16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads

ptxas info : Used 44 registers, 20336+0 bytes smem, 128 bytes cmem[0]

ptxas info : Compiling entry function ‘Z19k_calc_c_to_c_distsjjPjP6float3PfS_S2_S_jS2’ for ‘sm_20’

ptxas warning : Too big maxrregcount value specified 64, will be ignored

ptxas info : Function properties for Z19k_calc_c_to_c_distsjjPjP6float3PfS_S2_S_jS2

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for _ZSt4fabsf

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for Z9calc_rmsdjP6float3S0

40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads

ptxas info : Function properties for _ZSt4sqrtf

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Used 44 registers, 6000+0 bytes smem, 104 bytes cmem[0]

ptxas info : Compiling entry function ‘Z14k_parallel_maxjPfPjS_S0’ for ‘sm_20’

ptxas warning : Too big maxrregcount value specified 64, will be ignored

ptxas info : Function properties for Z14k_parallel_maxjPfPjS_S0

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for Z12parallel_MaxILj512EfjEvjPT0_PT1

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Used 12 registers, 4096+0 bytes smem, 72 bytes cmem[0]

ptxas info : Compiling entry function ‘_Z19k_center_conformersjjP6float3’ for ‘sm_20’

ptxas warning : Too big maxrregcount value specified 64, will be ignored

ptxas info : Function properties for _Z19k_center_conformersjjP6float3

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for ZmIR6float3RKS

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Function properties for _Z17center_conformersjjP6float3

40 bytes stack frame, 20 bytes spill stores, 20 bytes spill loads

ptxas info : Function properties for _ZdVR6float3RKf

16 bytes stack frame, 16 bytes spill stores, 16 bytes spill loads

ptxas info : Function properties for ZpLR6float3RKS

0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

ptxas info : Used 20 registers, 48 bytes cmem[0]

[b]

Basically I’ve told my compiler that I have up to 64 registers using maxrregcount available per thread (I’m using 512 threads per block), so 32768/512=64.

Yet the above ptx info shows that only a total of 44 registers have the used, with the rest spilling into local memory when I haven’t even saturated the 64 registers available. WTF[/b]

Please take note of the following warning which occurs multiple times in the build log:

ptxas warning : Too big maxrregcount value specified 64, will be ignored

The maximum number of registers available on sm_20 is 63, not 64. How do things look if you specify -maxrregcount=63? You may also want to look into using the __launch_bounds() mechanism to control register pressure, as it offers finer granularity control.

I tried it with 63 as well - also to no effect. I’ve also tried launch_bounds(512,1) - again, to no effect.

Hm, this does look a bit weird. Are you seeing this with CUDA 4.0? If so, would it be possible to post a self-contained repro case? A single kernel which exhibits the issue would be sufficient. What platform are you on? I could try the code on WinXP64 and Linux64, those are the only platforms at my disposal. One thing I would like to look at is whether the statistics reported are accurate.

4.0 / linux64

I can’t do a self-contained repro case - but I found the loop that causes a large chunk of the spill (this is a Newton Raphson iteration.

When I don’t comment this out, I get 40 bytes stack frame, 36 bytes spill stores, 36 bytes spill loads

301     for (int i = 0; i < 50; i++)

302     {

303         lambda_old = lambda;

304         lambda2 = lambda_old * lambda_old;// lambda^2

305         b = (lambda2 + C_2) * lambda_old; // b = lambda_old^3 + C_2*lambda_old

306         a = b + C_1;                      // a = lambda_old^3 + C_2*lambda_old + C_1

307         lambda = lambda_old - (a * lambda_old + detK) / (2.0 * lambda2 * lambda_old + b + a);

308         //     = lambda_old - lambda_old^4 + C_2*lambda_old^2 + C_1*lambda_old + C_0

309         //                    ------------------------------------------------------

310         //                            4 * lambda^3 + 2 C_2*lambda^2 + C_1

311         if (fabs(lambda - lambda_old) < fabs(1.0e-6 * lambda)) break;

312     }

With this loop commented out, I get 16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads

If I keep the loop, but I comment out only line 307 in the loop:

307 lambda = lambda_old - (a * lambda_old + detK) / (2.0 * lambda2 * lambda_old + b + a);

I get 24 bytes stack frame, 20 bytes spill stores, 20 bytes spill loads

Instead, if I change line 307 to simply:

307 lambda = (a * lambda_old + detK);

I still get 32 bytes stack frame, 32 bytes spill stores, 32 bytes spill loads. Thus, the a*lambda_old + detK operation itself is using 12 bytes or 3 extra registers!

If I comment out line 311 as well:

311 if (fabs(lambda - lambda_old) < fabs(1.0e-6 * lambda)) break;

What I still don't understand is why the compiler chooses to put it into local memory as opposed to the 19 registers still available to each thread.

I get 16 bytes stack frame, 12 bytes spill stores, 12 bytes spill loads