I’m trying to understand what I am seeing for the Visual Profiler regarding the number of registers per thread. For my kernel, it says 32 under “registers per thread”. Does that mean that each thread uses 32 registers? What things contribute to this? I certainly did not do: “int x = …” 32 times or anything like that. I can’t say that I understand where this coming from. I have a setup kernel that uses (as I see it) 4 registers, but the profiler says 15.
I will just show the setup kernel code:
[codebox]global void setup(double* d_values,
double** d_valuesPtr, double* d_ntimes, double** d_ntimesPtr, TimeStamp* d_times, TimeStamp** d_timesPtr, ulong* d_tiles, ulong** d_tilesPtr, ulong* d_offsetArray, ulong* d_valuesLengthRow, double* d_c, double** d_cPtr, double* d_p, double** d_pPtr, double* d_exp_deltas, double** d_exp_deltasPtr)
int i = blockIdx.x; ulong lenRow = d_valuesLengthRow[i]; ulong slice = lenRow % (unsigned)blockDim.x; slice = 0 ? (lenRow / (unsigned)blockDim.x) : ((lenRow / (unsigned)blockDim.x) + 1); d_valuesPtr[i] = &d_values[d_offsetArray[i]]; d_ntimesPtr[i] = &d_ntimes[d_offsetArray[i]]; d_timesPtr[i] = &d_times [d_offsetArray[i]]; d_cPtr[i] = &d_c[i * gridDim.x]; d_pPtr[i] = &d_p[i * gridDim.x]]; d_exp_deltasPtr[i] = &d_exp_deltas[d_offsetArray[i]]; d_tilesPtr[i] = &d_tiles[i * (NUM_TILES + 1)]; int n; int boundary = (1 + ((threadIdx.x + 1) * slice)); if (lenRow > 0) for (n = 1 + threadIdx.x * slice; n < boundary && n < lenRow; ++n) d_exp_deltasPtr[i][n] = exp(d_ntimesPtr[i][n-1]-d_ntimesPtr[i][n]);
I am in the process of changing the way I allocate my multidimensional arrays. I do allocate a 1D array with 2D pointers to the correct spots, but this is generally not a good idea (even though this change speeds up my kernel almost none). I call the kernel like so:
On my kernel methods, I have cut out nearly half of the registers I allocate, but the “32 registers per thread” is still reported by the profiler.
For my kernel, I see 12 registers used (profiler says 32). For the setup, I count 5 (profiler says 15). I replace all the “i” with “blockIdx.x”, cutting down the register usage by one, but the profiler still did not change.