The code is quite extensive. I cut most of it. Yes, I know, might be too much cut, but I can really hardly imagine that it’s relevant.
__global__ void kernel(uint64_t in, uint64_t *out) {
uint32_t tid = threadIdx.x + blockDim.x * blockIdx.x;
uint32_t a = 0;
uint32_t b[32] = { 0 };
uint32_t c[32] = { 0 };
// lots of stuff_1 done with a, c, in
// from now on, c isn't used anymore
// lots of stuff_2 done with a, b, in
if (a = 0x01234567) {
printf("---"); // option 1
// printf("result: %08x", calc_result(b)); // option 2
// printf("result: %08x", calc_result(c)); // option 3
}
}
Now … if I use option 1 to print “—” in order to show that there is a result, the kernel goes 20% slower than for option 2, where I print the result itself after additional calculation.
calc_result() does some bit shifting and logical operations. Disregarding kernel details, how is it possible that a simple printf("---") slows down the kernel by 20% vs a print of additional calculation? I don’t see what could possibly cause this opposite than expected performance yield.
Nsight Compute confirms the performance difference, but I can’t find the reason behind.
I tried a lot!
Translation: I have all this information at my disposal but cannot figure out what is going on here. If I discard most of that information and provide what little is left to a third party, surely they can figure it out.
Yes, it’s all about performance. printf() will not happen very often (every hour once or so). Question - does the simple fact, that printf()exists in the kernel code even if not used for 99.99999% of the time, slow down the remainder of the kernel as well? I noticed that the inside of the if() statement, though virtually never run, generates huge performance differences, even if ptxas register/memory usage remains identical or very similar.
I tried to omit all printf() and put the result in a kernel argument. Nevertheless the performance didn’t change at all (same as with printf()).
I also tried already in the past to detect bottlenecks using SASS/PTX, but I always failed to find the reason for my problem.
I’m fully aware that I it is very difficult to help with the little code I produce. There are non-disclosure agreements out there, and I have to take great care about what I post. I would prefer to show the complete code, but …
I was hoping to hear about a general possibilities/hints, like:
nvcc compiles such that it frees variables in the middle of the execution if they are not used anymore for the remainder of the code. This freeing process can eat up performance. In my case, variable b would be such a candidate and explain why option 2 is faster since b isn’t freed.
Adding code can possibly increase performance if … (??? when would this happen)
…
General hints would for sure help. I need some pointing hints which part (in general) to look at if rarely executed code slows down the entire kernel.
From observation across forty years of software optimization:
Generally speaking, the injection of printf() into performance-sensitive code tends to be quite intrusive, even if such a printf() never prints. This is true for host and device code. Some reasons for that are typically the associated (high) use of resources and the fact that the call to a function with side effects can create some sort of de-facto code motion barrier.
A lower impact solution is typically to capture a bare minimum of raw data (maybe just a single flag) inside the performance-sensitive code that later triggers printing outside of it.
As for the performance difference observed between using two different kinds of format string used with printf() my primary hypothesis is to call into question the measurement methodology, with the secondary hypothesis being is that this an example of the butterfly effect, applied to code generation.
Note that your performance measurements for “option 1” vs “option 2” are counterintuitive, causing me to question the validity of the data. Was the timing data accidentally swapped? Do the measurements consist largely of noise?
For host code, various compilers are now able to convert printf() with “pure” format strings like "---" into lighter-weight puts() calls. Of course, that doesn’t tell us anything about what happens with a device-side printf(), and I have never studied the impact on resource usage (e.g. registers, instructions) triggered by various kinds of printf() formats strings. A plausible assumption would be that resource usage due to a simple, “pure” format string should be lower than resource usage due to a more somewhat more complicated format string with format specifiers.
IMHO, printf() isn’t the culprit. As you suggested, If I left printf() out and replace it with atomicAdd() (in order to show it as flag after kernel return to host), but I still have these 20% performance loss.
I measure code execution during #1 minute with samples evry #5 seconds on a reduced grid. This looks correct to be. I used the CUDA timing functions (cudaEventRecord(), …)
Timings were (unfortunately) not swapped. They are very counterintuitive, correct! The measurements are quite stable = <1% variation across multiple kernel calls.
Some more testing …
I noticed that my issue might be related to variable storage memory location. I added uint32_t c[32] together with option 3 to my OP snippet.
To summarize:
Option 1 - slow (printf of literal string)
Option 2 - fast (printf of *b)
Option 3 - slow (printf of *c)
b[] and c[] are both located in registers (to the best of my knowledge - how can I confirm this?). Option 3 (accesses c[]) is slower than Option 2 (accesses b[]). Notice that c[] is not used anymore when stuff_2 is processed. Could it be (?) that the compiler moved c[] from register to local memory after stuff_1 was completed? This would be my only possible explanation why calc_result(b)) is faster (registers) than calc_result(c)) (local memory).
Besides this, it still doesn’t explain why a simple printf("---") is so slow.
Note: b[] and c[] (arrays) are always in local memory. However I copied these to kernel variables, which reside in registers (e.g. uint32_t b0 = b[0]).
The if () at the end is keeping the compiler from dead stripping a lot of code. We are not clear if you have any side effects in the removed code. You can run all three options in Nsight Compute and use the baseline feature to compare the 3 variants. If you cannot easily detect the major difference on the Details page then I would recommend opening each report in a separate copy of Nsight Compute and manually comparing the rolled up counters in the Source View. Once you find a major difference in the source view you can inspect sub-sections of the SASS.