Hello everyone, there are two identical kernel functions. In one of them, an array ff[10] is used, while in the other, individual variables ff0, ff1, ff2, ff3, ff4, ff5, ff6, ff7, ff8, ff9 are used. The variables used in these two kernel functions correspond to each other one by one, and the computation steps are exactly the same. Why does the former use fewer registers than the latter?
Can you show the code in question? How big is the difference in the number of registers used? You are looking at a release build with full optimization, correct?
Generally, small differences in high-level source code can lead to small differences in the generated machine code with accompanying changes in register allocation (there are interdependencies between instruction selection and scheduling and register pressure, for example). So the observation is not unusual as such. My expectation would be that the differences in generated SASS (machine code) and register use are small. Note also that the number of registers used could change in favor of either variant depending on target architecture.
If the kernel code is small enough, it might be possible to trace differences from the source code through the PTX intermediate representation to the final machine code.
Thank you for your response.
1.However, I may not be able to provide a complete code display.
2.There is indeed a significant difference in the number of registers. If one kernel sets f[10] while the other sets f0, f1, … , f9, the two kernel functions differ by more than ten registers.
3.It’s important to note that this is not a fully optimized version. The code represents a newer parallel algorithm implementation for a specific numerical method.
When I mentioned “with full optimization” I was referring to compiler flags. By default, nvcc
compiles with full optimization in release builds, but is is possible for programmers to override that with -Xptxas -O{0|1|2}
for the backend, which is the part that does the register allocations.
Cutting down the affected code as much as possible while still being able to reproduce the register usage issue is actually highly desirable, creating a so called “minimal reproducer”. In order to be suitable for analysis in this forum it would have to be at least compileable without additional third-party header files and libraries though.
At the moment, I cannot think of anything that would cause a register usage difference of ten registers based on the conditions described.
I will note that register usage by itself is not strongly correlated to performance. At present, which variant is the faster one (array or scalars), according to the CUDA profiler? The CUDA compiler tries to deliver maximum performance, and it may have applied an optimization that allows it to squeeze more performance out of the variant using scalars that is inhibited by the use of an array (although arrays of size 10 with compile-time constant indexing are typically scalarized by the compiler “under the hood”, which makes the present case somewhat puzzling).
Your insights have been very enlightening to me. Here is a portion of my code: when I convert the defined array ffeq[10]
into individual variables ffeq1
, ffeq2
, ffeq3
, and so on, the number of registers per thread increases significantly. However, under the same occupancy conditions, the speed of the arrays is indeed slower than that of the scalars.
CODE
float ffeq[10], ff[10], dd, uu, vv, tao;
ffeq[1] = 1.0/9.0 * dd*(1.0 + 3.0uu + 3.0uuuu - 1.5vvvv);
ffeq[2] = 1.0/9.0 * dd(1.0 - 3.0uu + 3.0uuuu - 1.5vvvv);
ffeq[3] = 1.0/9.0 * dd(1.0 + 3.0vv + 3.0vvvv - 1.5uuuu);
ffeq[4] = 1.0/9.0 * dd(1.0 - 3.0vv + 3.0vvvv - 1.5uuuu);
ffeq[5] = 1.0/36.0 * dd(1.0 + 3.0uu + 3.0vv + (uuuu + vv * vv)3.0 + uu * vv9.0);
ffeq[6] = 1.0/36.0 * dd(1.0 - 3.0uu + 3.0vv + (uuuu + vv * vv)3.0 - uu * vv9.0);
ffeq[7] = 1.0/36.0 * dd(1.0 - 3.0uu - 3.0vv + (uuuu + vv * vv)3.0 + uu * vv9.0);
ffeq[8] = 1.0/36.0 * dd(1.0 + 3.0uu - 3.0vv + (uuuu + vv * vv)3.0 - uu * vv9.0);
ffeq[9] = 4.0/9.0 * dd(1.0 - 0.5*(uu*uu + vv * vv)*3.0);
ff[1] = ffeq[1] - (ffeq[1] - ffeqi[1])*tao;
ff[2] = ffeq[2] - (ffeq[2] - ffeqi[2])*tao;
ff[3] = ffeq[3] - (ffeq[3] - ffeqi[3])*tao;
ff[4] = ffeq[4] - (ffeq[4] - ffeqi[4])*tao;
ff[5] = ffeq[5] - (ffeq[5] - ffeqi[5])*tao;
ff[6] = ffeq[6] - (ffeq[6] - ffeqi[6])*tao;
ff[7] = ffeq[7] - (ffeq[7] - ffeqi[7])*tao;
ff[8] = ffeq[8] - (ffeq[8] - ffeqi[8])*tao;
ff[9] = ffeq[9] - (ffeq[9] - ffeqi[9])*tao;
Does the compiler place the array in local memory? You can check this by compiling with -Xptxas "-v"
. If this shows a stack frame of 0 bytes, it is not the case. If the stack frame is > 0, check if array accesses in sass code involve stl
or ldl
instructions.
If in the array version the array is placed in local memory by the compiler instead of registers, it would be no surprise that the non-array version is faster and uses more registers.
[Please put source code blocks between two lines consisting of ``` to ensure proper markup].
It is impossible to make a diagnosis from code snippets like the one shown. The code must be compileable in order to inspect the generated code. From what is shown in the code, I would expect the compiler to scalarize the local arrays as all indexing is compile-time constant and the arrays are small. Without scalarization of the arrays (which may be impeded by usage not shown in the snippet), the array version will require fewer registers compared to the manually scalarized version. It will also be slower.
Note: The floating-point literals in this code are all of type double
, although the data being processed is of type float
. This will cause (by prescribed C++ type promotion) the entire computation to be performed in double precision, and add conversions between float
and double
, with a resulting negative performance impact. Unless this was done on purpose, you would want to use floating-point literal constants of type float
, that is, with an f
suffix. Examples: 1.0f, 9.0f, 3.0f, 1.5f
.
Thank you for your continued attention to this issue. I’m glad I was able to successfully resolve your problem.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.