Nested Loops - register usage


I have a question regarding register usage. My code includes the following nested loop.
While register usage is 24 for the outer loop, it rises to 71 for the inner “sum += …” statement which is close to the 72 allocated registers per thread. All values are of type “int”. Am I doing something worng? From my understanding there is no demand for this higher number of registers as the reisters used to store the summands’ values can be reused for each loop iteration.

Furthermore the profiler shows me that the number of executed instructions in the “sum += …” line is about 450.000 while the number of predicted on thread instructions is 13.000.000. Is that expected behavior? For testing purpose I all threads are executed with exactly the same input data and should behave exactly the same. demoObject1 is a pointer to a class object located in global memory. demoObject2 is a thread specific pointer to an object aksi located in global memory.

for (int k = 0; k < demoObject1->someValue1; k++) {
 for (int i = 0; i < demoObject1->someValue2; i++) {
  sum += demoObject1->someArray[i][demoObject2->someValue[i]][k];

Just to be clear: I’m aware that it is common to parallize loops across diffrent GPU threads. However this is not an option for the kind of problem I’m trying to solve.


The compiler may be unrolling your loop. This can dramatically increase register usage. It’s not possible to confirm that based on what you have shown, however.