Different output of code when not unrolling loop

Hi all,

I have run into an odd issue, where it appears the addition of “#pragma unroll 1” changes the output of my code, producing a wrong result.

The code is:

#include <stdio.h>

__global__ void reference(float *out, int page_size, int N_pages) {
	for(int i = 0; i < N_pages; i++) {
		out[i * page_size] = out[i * page_size];
	}
}

__global__ void pipelined(float *out, int page_size, int N_pages) {
	const int P_depth = 2;
	float ld, val;
	#pragma unroll 1
	for(int i = -P_depth; i < N_pages; i++) {
		// For the current iteration, use pipeline-loaded value.
		if(0 <= i)
			out[i * page_size] = val;
		
		// Grab loaded result to pass to next iteration.
		val = ld;
		
		// Load two iterations ahead.
		if(i+P_depth < N_pages)
			ld = out[(i+P_depth) * page_size];
	}
}

int main() {
    int N_pages = 8;
    int page_size = 1534 * 168;
    int N = page_size * N_pages;
    float * h_res = (float*) malloc(sizeof(float)*N);
    float * h_res_ref = (float*) malloc(sizeof(float)*N);
    float * d_res_ref, * d_res;
    cudaMalloc((void**)&d_res, sizeof(float)*N);
    cudaMalloc((void**)&d_res_ref, sizeof(float)*N);
    
    // Now write 0...N-1 into h_res
    for(int i = 0; i < N; i++) {
    	h_res[i] = i;
    	h_res_ref[i] = i;
    }
    cudaMemcpy(d_res, h_res, sizeof(float)*N, cudaMemcpyHostToDevice);
    cudaMemcpy(d_res_ref, h_res_ref, sizeof(float)*N, cudaMemcpyHostToDevice);
    
    // COMPUTE
    reference<<<1, 1>>>(d_res_ref, page_size, N_pages);
    pipelined<<<1, 1>>>(d_res, page_size, N_pages);
    
    // Check result
    cudaMemcpy(h_res, d_res, sizeof(float)*N, cudaMemcpyDeviceToHost);
	cudaMemcpy(h_res_ref, d_res_ref, sizeof(float)*N, cudaMemcpyDeviceToHost);
	cudaDeviceSynchronize();
	
    for(int i = 0; i < N; i++) {
		if(h_res[i] != h_res_ref[i]) {
    		printf("ERROR: Different value on page %d!\t%f != %f\n", i/page_size, h_res[i], h_res_ref[i]);
    		// break;
		}
	}
	
	cudaFree(d_res);    cudaFree(d_res_ref);
	free(h_res);    free(h_res_ref);
}

Compiled with NVCC 11.7 using arguments “-arch=sm_86”.

The error appears to only happen when unrolling is disabled.

I noticed in some other code I get the opposite problem, where the addition of e.g. “#pragma unroll 2” introduces error.

I see a use of the variable ld prior to initialization in the following line of code:

// Grab loaded result to pass to next iteration.
val = ld;

Hypothesis: The observations regarding unrolling will disappear when all variables are initialized prior to their first RHS use.

You could file a bug.

I confirm the observation by njuffa, in my test case, if I change this:

float ld, val;

to this:

float ld = 0, val = 0;

then the observation is no longer reproducible. I agree that such an assignment:

	val = ld;

on the first pass of the loop is UB, however I’m not sure if the scope of that UB is for the contents of the variable itself or the entire program. I’m not enough of a C++ “lawyer” to know what the compiler is allowed to do in that case. With respect to the the first pass of the loop, that assignment, and indeed the actual values of val and ld have no bearing on the output of the program (other than what an exception due to UB may “allow” the compiler to do.)

At least as a work-around, then I suggest (as njuffa did) that you initialize these variables.

In colloquial parlance UB equates to “nasal daemons”: Anything can happen. In the C++ standard this reads as follows (here the 2011 version):

Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message)

1 Like

then I guess this is not a compiler issue. Filing a bug may be unproductive. I defer to the wisdom/knowledge of njuffa.

Hehe, I guess it is indeed a nasal demon.

Thank you so much for the quick responses.

I am not a C++ language lawyer, and I find it notoriously difficult to recall what constructs specifically constitute undefined behavior, unspecified behavior, and implementation-defined behavior.

Use of uninitialized data is something that is relatively easy to find and fix. More insidious are typically various instances of UB that can arise in the use of signed integers.

1 Like

I just avoid unsigned integers whenever possible for that reason; there is too much to think about, with very little benefit.

Compilers usually cannot detect the use of uninitialized data with certainty. That is why corresponding error messages typically read something like “potentially uninitialized variable ‘name’”, where “potentially” is the operative term. Additional tools like valgrind (or code reviews :-) are needed to find uses of uninitialized data. You might want to experiment with CUDA’s tool compute-sanitizer, specifically --tool initcheck.

From what I understand, compilers have even less of a chance of reliably diagnosing various kinds of UB, and since reporting a bunch of false positives just serves to make programmers upset no such attempts are typically made. The current CUDA toolchain is based on LLVM. Here is a three-part blog post from the LLVM folks on the topic of undefined behavior that is certainly worth reading. The title refers to C programmers, but from what I can see all the examples also apply to C++ (which is derived from C, after all):

In terms of UB, it is signed integers that are often problematic. Overflow of signed integer expressions triggers UB, while the overflow behavior is unambiguously defined for unsigned integers (wrapping modulo 2n). The flip side is that operations on signed integers can often be optimized more aggressively by the compiler, as it can simply assume that no situation triggering UB ever occurs. That is possible because in the presence of UB, anything is allowed to happen, so if the compiler’s assumption does not hold, no specific behavior on its part is required.

For various reasons, the conventional wisdom for languages in the C family is “Every integer wants to be an int, unless there is a darn good reason for it to be some other type”. In my experience, this heuristic hols in the vast majority of cases.

Exactly! int is the way to go. I’m mostly writing research/experimental code, so I don’t want to waste time thinking “hmm, should I use size_t here or no”, or the like, and then have to worry about overflow semantics.
Overall, I prefer a higher-level language just because it is faster to try out new ideas, and then CUDA to accelerate hotspots. The article was a good read.

In my code example I’m trying to pipeline a loop with a two-stage software pipeline (the line “val=ld;” was originally “val=sqrt(ld);”) to hide scoreboard latencies. I find this is always a better optimization than unrolling (given there are registers to spare), but what’s really frustrating for me is that suboptimal code is produced if the pipelined loop is unrolled, as iterations are then not independent anymore.

Ideally, the loop must be unrolled, and first THEN pipelined, which requires me to manually implement both of these error-prone optimizations. This gave me a quite a boost in performance (over 2.5x faster than the reference), but with this manual edit, the code was also quite sensitive to the ordering of lines; with more-register-using orderings being faster, up to a limit… I really wish NVCC could do more to help.

Maybe it is possible to implement unrolling and pipelining with preprocessor macros, I’m not sure. Even then, it appears NVCC does a poor job of reordering my instructions. The heuristic should simply be to maximize the flight-time of the registers involved in scoreboard operations.

I’ll have to try the clang compiler (e.g. write an optimizer pass) or maybe write my own code generation, I think.

Does this software-pipelining approach lead to robust performance improvements on GPUs? It sounds like the experience shows that the answers to that is mostly “no”. Which isn’t surprising because trying to manipulate the compiler through source-level transformations is often brittle: with the next CUDA release you get to rework the whole thing all over again. Been there, done that, built a t-shirt collection.

FWIW: When I read software-pipelining earlier in this thread I had nightmare flashbacks to MIPS platforms from thirty years ago: No, not that again!

I worked closely with the CUDA compiler team for some 4.5 years back in about the 2008 to 2012 timeframe, and from looking at the generated code these days, I find little to complain about. If you find demonstrable shortcomings in code generation or instruction scheduling or anything else, consider filing bugs or enhancement requests with NVIDIA. NVIDIA is generally responsive to enhancement requests from customers, but of course the numerous requests must all be collated an prioritized. Proposal with wider applicability to CUDA-accelerated applications have a higher the chance to be worked on.

In judging the quality of the generated code you might also consider the possibility that NVIDIA’s compiler team has access to more detailed information about the hardware than the general public, and what might look like a better instruction schedule to you may actually not provide benefits.

An example. The original function

__global__ void reference(float *out, int page_size, int N_pages) {
	for(int i = 0; i < N_pages; i++)
		out[i * page_size] = sqrt(out[i * page_size]);
}

Runs in 553 usec, uses 22 registers. Then assuming N_pages>3 and is a multiple of two:

__global__ void pipelined(float *out, int page_size, int N_pages) {
	float val=0, val1=0, val3=0, ld=0, ld1=0;
	
	ld = out[(0+2) * page_size];
	ld1 = out[(1+2) * page_size];
	val = sqrt(out[0 * page_size]);
	val1 = sqrt(out[1 * page_size]);
	
	int i = 0;
	#pragma unroll 2
	for(i = 0; i+1+2*2 < N_pages; i+=2) {
		// asm ("prefetch.global.L1 [%0];" :: "l"(&out[(i+2*2+12) * page_size]));
		// asm ("prefetch.global.L1 [%0];" :: "l"(&out[(i+1+2*2+12) * page_size]));
		out[i * page_size] = val;
		out[(i+1) * page_size] = val1;
		
		val = sqrt(ld);
		ld = out[(i+2*2) * page_size];  // <- Two iterations ahead, times two to account for manual unrolling
		val1 = sqrt(ld1);
		ld1 = out[(i+1+2*2) * page_size];
	}
	
	for(; i < N_pages; i++)
		out[i * page_size] = sqrt(out[i * page_size]);
}

Which runs in 178 usec and uses 30 registers: 3.1x speed-up.
Additionally using prefetch brings the time down to 105 usec: 5.25x speed-up, and uses 40 registers!
But it is quite sensitive to ordering of the lines. The above ordering is good as loads are put in-flight as soon as they have been consumed - and then it is just a question of putting as much stuff in between this cycle as possible to maximize flight-time for the loads.

My use case is that I am beamforming ultrasound images, and what we get is essentially a (variable, but) strided access (each transducer records a line/page of audio).
Then this computation eventually ends up being latency-bound.

In my experience NVCC never does pipelining, but it aggressively unrolls loops, which is odd.
Pipelining can be superior in terms of less register use in many cases.
See, e.g., the attached example pipelined.pdf (85.0 KB). The reason is essentially that pipelining is a way to extend the life-time of a register beyond the one loop iteration, increasing the potential for instruction-level parallelism.

Many loops start with a LDG, and unrolling will just put more LDGs in-flight, but this just amortizes the stall as there is very little compute to hide the latencies. With pipelining, the LDG could be pipelined from a previous iteration, allowing all the computation in the previous iteration of the loop to hide its latency.

Of course the combination of both optimizations can lead to more optimal code than only one. Thread-level parallelism + unrolling gives many instructions to help in hiding stall, but it is not sufficient for a high-latency LDG as in my use case - and pipelining was essential for me to reach a real-time processing of the data.

Anecdotally, I seem to be seeing much more aggressive unrolling from nvcc in CUDA 11 than in previous CUDA versions. That may be due to observational bias on my part, or it may be due to actual changes to compiler heuristics, possibly (speculation!) driven by changes in GPU architecture that make that an attractive strategy (e.g. if ICache size increased in recent GPUs compared to older models).

Generally speaking, loop unrolling, and especially full unrolling of loops, enables further optimizations. For example, it can allow the scalarization of small local arrays, and it can improve the mobility of loads (especially when pointer arguments use the __restrict__ attribute) which can improve latency tolerance. After function inlining loop unrolling is probably the most effective simple high-level transformation a compiler can apply to device code. That is why the CUDA compiler loves to apply both more aggressively than is typically seen with CPU toolchains (although if I look at code produced by the latest Clang, it seems to be moving in that direction as well).

Keep in mind that the GPU’s main mechanism for covering latency is to have tens of thousands of threads in flight. It is the programmer’s responsibility to ensure there is sufficient parallelism for the hardware to exploit. A task well suited to GPUs should be throughput constrained (that could be memory or computational throughput, or both), not latency constrained. Latency constrained task are often a better fit for CPUs, especially those with high single-thread performance.

I retired from NVIDIA eight years ago and have not talked to their compiler guys since. I do not recall software pipelining being part of the game plan “back in the days”, and I see no evidence of it in code generated by CUDA 11. Maybe there are other people beside me who have bad memories of going down that path?

As I stated, the best way for regular CUDA programmers (that do not work for organizations that have a closer relationship with NVIDIA) to influence future product development is to submit well-reasoned enhancement requests supported by data, or well-documented bug reports. Both can be filed through the online bug reporting form. It is usually a good idea to prefix the synopsis of enhancement requests with RFE: so they get sorted into the correct bin right away.

You’re right, using const * __restrict__ (and not modifying in-place) results in code much less sensitive to ordering - good catch!
I’ll try making a request.