Register allocator overload


I have a large kernel on the form:


foreach k


foreach y


  unsigned tex_base_y = 0;

  float sum = 0;

unsigned data = tex1Dfetch(tex_ref_1d, row_offs + (x0 + 20) + 11 * 768);

  tex_x = (data >> 25) & 0x7;

  tex_y = (data >> 29) & 0x7;

  sum += tex2D(tex_ref_2d, tex_x, tex_base_y + tex_y);

  tex_base_y += 8;

<NUMREP repeats of code block above, with different constants>




In one particular case, if NUMREP is 105 or less, my kernel is using 31 registers or less, and performance is 114Mpix/s. But if I make NUMREP=106, register usage jumps up to 35. I guess that is not so bad, but at the same time, performance drops immediately down to 92Mpix/s, and this happens just by adding 4 lines of code. From what I can tell, adding another REP should (in a perfect world) not increase register usage at all, since no new variables are introduced. Shared mem usage is on both cases 40 bytes, while constant mem usage increase from 24 to 32 bytes.

Does anyone have a guess about what happens? Am I running into some nvcc compiler limit that makes it choose another code generation strategy?


It might be that the optimizer is bringing some statements forward to try to ‘pre-load’ values from memory. Then you do have ‘new’ variables and the resulting extra registers used.

You can specify with -maxregcount=32 to keep the number of registers down.

Thanks E.D. I just tried using -maxregcount=31. It ends up using a bit of local memory instead, but I still get the same, sharp performance drop when going from 105 to 106 REPS. I had quick lock at the generated code using decuda, but couldn’t see any major differences.

I guess it could be some cache effect, where going from 105 to 106 reps makes the last rep throw out the texture cache data for previous REPs. The cuda profiler doesn’t show any major difference between the 105 and 106 version. It would be useful to be able to get texture cache hit rate from the profiler. Still, the performance decrease is surprisingly large. When going from 10 to 105 reps, the performance slowly drops from 151 to 112 Mpix/s, and from 106 and on to 120+ it immediately drops to 92 Mpix/s.