CUDA compiler bug or user error?

I’m reluctant to point the finger at the compiler, but this behaviour seems suspicious, especially since a trivial and seemingly equivalent change resolves the problem for no clear reason.

For now I’ll just post the kernel, but I can give an example of host execution to if necessary.

As the code comments mention, the kernel takes in an m x n matrix A and an m-length vector v, and to every element in a given row of A adds the corresponding element from v in-place. I’ve tried to optimize a bit using some loop unrolling.

// takes A(m x n) and v(m); adds an element from v in-place to each row of A.
// note, it is assumed that:
// int block_width = (n + blocks_per_row - 1) / blocks_per_row; // number of elements within a single row a block calculates
// int unroll_stride = (block_width + unroll_factor - 1) / unroll_factor;
// blockDim.x = (block_width + unroll_factor -1) / unroll_factor;
template <unsigned int unroll_factor>
__global__ void add_by_rows_kernel(const int m, const int n, Real * A, const Real * v, const int blocks_per_row, const int block_width, const int unroll_stride)
{
	int this_row = blockIdx.x / blocks_per_row;
	int curr_block_this_row = blockIdx.x % blocks_per_row;

	// calculate pointer to first element applicable to this thread
	Real * starting_point = A + n * this_row // beginning of row
		+ block_width * curr_block_this_row // beginning of this block
		+ threadIdx.x; // first element within sub-block for this thread

	// choose largest max_i s.t.
	// block_width * curr_block_this_row + threadIdx.x + (max_i - 1)* unroll_stride < n
	int max_i = (n - block_width * curr_block_this_row - threadIdx.x + unroll_stride - 1) / unroll_stride;

	// get the element we're adding and store it in shared memory
	__shared__ Real ele;
	ele = v[this_row];

	// register array for intermediary calculations
	Real arr[unroll_factor];

	// get data
#pragma unroll
	for (int i = 0; i < unroll_factor; ++i)
	{
		if (i < max_i)
		{
			arr[i] = starting_point[i * unroll_stride];
		}
	}

	// process data
#pragma unroll
	for (int i = 0; i < unroll_factor; ++i)
	{
		if (i < max_i)
		{
			arr[i] += ele;
		}
	}

	// store data
#pragma unroll
	for (int i = 0; i < unroll_factor; ++i)
	{
		if (i < max_i)
		{
			starting_point[i * unroll_stride] = arr[i];
		}
	}
}

When I debug it in Nsight with the memory checker on, it flags a bunch of memory violations. This is happening within the for loops due to the fact that the bounds checking logic

if (i < max_i)

is not working correctly. Specifically, a watch on max_i reveals a ridiculously large positive
number in cases where the expression

int max_i = (n - block_width * curr_block_this_row - threadIdx.x + unroll_stride - 1) / unroll_stride;

should actually be negative, and therefore stop memory access completely for those threads. (Yes, I am currently running with more threads than the matrix size requires.)

What’s especially strange is that the problem can be trivially resolved in one of 2 ways:

  1. Breaking up the calculation of max_i into
int max_i_num = n - block_width * curr_block_this_row - threadIdx.x + unroll_stride - 1;
int max_i_den = unroll_stride;
int max_i = max_i_num / max_i_den;

This stops the memory access errors, and when I break I can see that max_i is correctly negative in the cases where it should be.

  1. Get rid of max_i completely and replace all the if statements with the more verbose (and computationally expensive)
if (block_width * curr_block_this_row + threadIdx.x + i* unroll_stride < n)

Again this stops the memory access errors.

I’m hoping that this can either be confirmed as some kind of compiler optimization bug, or some explanation given for the problem (and why either of these simple changes resolves it).

I would suggest providing a complete repro case with a test (e.g. if…printf) that indicates when the unexpected large positive value of max_i occurs (i.e. make it easy for others to help you.) Also please indicate which CUDA version and which driver version.

Here’s the complete example. I’ve modified the kernel to calculate max_i both ways and printf a message whenever they don’t match (along with the values used to calculate them). The results make it clear that max_i is not getting calculated correctly the original way but for some reason is correct the second way.

I’m using CUDA 8.0, compute capacity 5.0, and driver version 21.21.13.7651 running on a GTX 1060.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>

// takes A(m x n) and v(m); adds an element from v in-place to each row of A.
// note, it is assumed that:
// int block_width = (n + blocks_per_row - 1) / blocks_per_row; // number of elements within a single row a block calculates
// int unroll_stride = (block_width + unroll_factor - 1) / unroll_factor;
// blockDim.x = (block_width + unroll_factor -1) / unroll_factor;
template <unsigned int unroll_factor>
__global__ void add_by_rows_kernel(const int m, const int n, float * A, const float * v, const int blocks_per_row, const int block_width, const int unroll_stride)
{
	int this_row = blockIdx.x / blocks_per_row;
	int curr_block_this_row = blockIdx.x % blocks_per_row;

	// calculate pointer to first element applicable to this thread
	float * starting_point = A + n * this_row // beginning of row
		+ block_width * curr_block_this_row // beginning of this block
		+ threadIdx.x; // first element within sub-block for this thread

					   // choose largest max_i s.t.
					   // block_width * curr_block_this_row + threadIdx.x + (max_i - 1)* unroll_stride < n
	int max_i_num = n - block_width * curr_block_this_row - threadIdx.x + unroll_stride - 1;
	int max_i_den = unroll_stride;
	int max_i_check = max_i_num / max_i_den;
	int max_i = (n - block_width * curr_block_this_row - threadIdx.x + unroll_stride - 1) / unroll_stride;

	// get the element we're adding and store it in shared memory
	__shared__ float ele;
	ele = v[this_row];

	// register array for intermediary calculations
	float arr[unroll_factor];

	// get data
#pragma unroll
	for (int i = 0; i < unroll_factor; ++i)
	{
		// if values don't match print details
		if (max_i != max_i_check)
			printf("n=%d, block_width=%d, curr_block_this_row=%d, threadIdx.x=%d, unroll_stride=%d, max_i=%d, max_i_check=%d\n",
				n, block_width, curr_block_this_row, threadIdx.x, unroll_stride, max_i, max_i_check);

		if (i < max_i)
		{
			arr[i] = starting_point[i * unroll_stride];
		}
	}

	// process data
#pragma unroll
	for (int i = 0; i < unroll_factor; ++i)
	{
		if (i < max_i)
		{
			arr[i] += ele;
		}
	}

	// store data
#pragma unroll
	for (int i = 0; i < unroll_factor; ++i)
	{
		if (i < max_i)
		{
			starting_point[i * unroll_stride] = arr[i];
		}
	}
}

// helper function to simplify calling
#define ADD_UNROLL_FACTOR 8
#define ADD_THREADS_PER_BLOCK 256
void device_add_by_rows(const size_t m, const size_t n, float * A, const float * v)
{
	const size_t total_threads_per_row = (n + ADD_UNROLL_FACTOR - 1) / ADD_UNROLL_FACTOR;
	const size_t blocks_per_row = (total_threads_per_row + ADD_THREADS_PER_BLOCK - 1) / ADD_THREADS_PER_BLOCK;
	const size_t block_width = (n + blocks_per_row - 1) / blocks_per_row;
	const size_t unroll_stride = (block_width + ADD_UNROLL_FACTOR - 1) / ADD_UNROLL_FACTOR;
	const size_t total_blocks = blocks_per_row * m;

	add_by_rows_kernel<ADD_UNROLL_FACTOR> << <(int)total_blocks, ADD_THREADS_PER_BLOCK >> >((int)m, (int)n, A, v, (int)blocks_per_row, (int)block_width, (int)unroll_stride);

	cudaDeviceSynchronize();
}

int main()
{
	int m = 13;
	int n = 15;

	float * A, *v;
	cudaMalloc((void**)&A, m*n * sizeof(float));
	cudaMalloc((void**)&v, m * sizeof(float));

	device_add_by_rows(m, n, A, v);

	cudaFree(A);
	cudaFree(v);
}

Why this:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

Beware the mixing of signed and unsigned integers, leading to unsigned integer computation in the max_i case. You want:

max_i = n - block_width * curr_block_this_row - (int)threadIdx.x + unroll_stride - 1) / unroll_stride;

In the max_i_check case, the intermediate cast to ‘int’ gets you back on track.

Oh derp, thx. This is why I’m always reluctant to implicate the compiler ;)

Dunno, thought it was just what the cool kids did. When are these headers actually needed?

The CUDA compiler pulls in CUDA-specific header files when you compile code in a .cu file. The fact that these includes are pulled in “under the hood” was a conscious design decision at the birth of CUDA, to make CUDA programs as similar to comparable C/C++ programs as possible (lower perceived barrier to entry).

CUDA header files need to be included when calling CUDA APIs from C or C++ code.

The definition of

uint3 __device_builtin__ __STORAGE__ threadIdx;

using an unsigned integer is an unfortunate design decision in my opinion, because it easily leads to the kind of signed/unsigned mismatch (with silent conversion to the “wider” unsigned type) in any language in the C/C++ family.

As a wise software engineering manager with extensive experience once told me: in C/C++ every integer wants to be ‘int’, unless there is a darn good reason for it to be something else. The fact that the value cannot be negative is not a “darn good reason”.

I have found that to hold true in my own experience. In the CUDA case, I immediately assign threadIdx.x etc to my own ‘int’ variables, e.g. tid, tidX, tidY.

If you use a signed int for an index that logically is never negative you could be getting the same range of possible values for that index from an unsigned int at half that precision (or equivalently twice as many useful bits using the same precision). Better memory efficiency seems like a “darn good reason” to me :)

It should have occurred to me that threadIdx was unsigned, but the silent conversion certainly didn’t extend much charity. It seems like that’s perhaps where the blame should lie.

The fact is that one almost never really needs the additional range between 231 and 232-1, so one might as well go with ‘int’. And unless memory capacity or throughput are a limiting factor for a particular use case, it’s not a “darn good reason” to use a narrower integer type.

There is such as thing as the TCO (total cost of ownership) of code for an organization that maintains it, and it tends to go up when programmers start using unsigned types or small integer types in C/C++ without a concrete and pressing need. The manager in question had 20 years experience in C/C++ (and 25 years professional experience in total) when he related that rule of thumb. I have accumulated about 30 years by now, and have come to concur.

I am too lazy to check whether the rationale document for ANSI C standard includes justification for the signed->unsigned conversion. But given how long the C-languages have been around (since 1978, I think; I first used C in 1983), it wouldn’t make sense to change basic mechanisms that have been in place since the stone age. I think the GNU toolchain can warn against signed/unsigned mismatches in particular, but generally speaking, warnings are not the strong suit of the CUDA toolchain.

Well I was born the year before you first used C, so you’ve got me beat on the experience front :) Perhaps we can’t change the mechanisms, but we can still blame them and feel smugly satisfied… then talk about the inevitable conquering of all things by Python while we play foosball with the cool kids.

JK, I’m terrible at foosball. As for Python, I fart in it’s general direction.

Did I mention we still carved individual bits by hand in those days :-)

Like many people back then I started out programming in BASIC, on a machine that had been upgraded from 4KB to 16KB of system memory. Yes, kilo-bytes. I also learned programming in Z80 machine language. No assembler available, so I entered the code byte-by-byte in hex via CP/M’s DDT debugger. Starting out with severe hardware and software restrictions was a great incentive for getting into software optimization work.

I bet it was! I may be younger than you, but the hipster in me wants to bring back the good ol’ days. Computers are so small, fast, and energy efficient these days. It’s terrible. Let’s bring back punch cards and vacuum tubes! Give me a hot room full of soothing glowing orange light anytime!

Honestly, I have no hankering to get back to the stone ages of computing anytime soon. I do wish students today would be exposed a bit more to how all that cool software maps to the underlying hardware, so they know what is likely efficient and what is likely inefficient use of that hardware.

In most cases software performance optimization also means energy optimization (not necessarily power, that could even go up), and that is a desirable goal, IMHO. No point in trashing the planet to support all that cool computational stuff we can do today, in no small part thanks to parallel computing technologies like CUDA: ultrasounds on a mobile phone, MRI scans with millimeter resolution in real time, multi-day weather forecasts with microclimate coverage on a square-mile grid, etc, etc.

Me neither, I’m just mocking my generation of hipster millennials and their absurd sense of nostalgia. Obviously technology is what moves us forward as a species, not getting stuck in the past. That said, I do like me some tubes when it comes to music amplification… but that’s for a different forum :)

I actually have a serious question about loop unrolling.

From my code above:

#pragma unroll
	for (int i = 0; i < unroll_factor; ++i)
	{
		if (i < max_i)
		{
			arr[i] = starting_point[i * unroll_stride];
		}
	}

Does the if statement interfere with instruction level parallelism? I.e. is the compiler smart enough to know it needs to evaluate all the if statements in all the threads first, so that it can then make coalesced/parallelized memory reads?

Instruction-level parallelism is rarely worth worrying about on GPUs. First, the hardware doesn’t actually provide much in terms of ILP mechanisms. Second, the major source of parallelism in GPUs is zero-overhead thread switching (this is a conceptual simplification!).

To check up on the compiler, I would suggest learning how to read (even if it is only in cursory fashion) the output from cuobjdump --dump-sass. While NVIDIA does not provide a detailed description of the machine instruction sets used in their GPUs, they do provide a short overview with one-line descriptions somewhere in the docs (sorry, don’t recall offhand where). [Later:] See section 4 here: [url]CUDA Binary Utilities :: CUDA Toolkit Documentation

You can definitely spot load (LD, LDG) and store (ST) instructions, and if you see that they are wider than your native data you know you are getting vectorized access. I haven’t checked on the CUDA compiler’s ability to auto-vectorize loads and stores recently; I tend to use explicit vector types like ‘int4’ when I want them.

As for if-statements with small bodies you will find that they are frequently converted into branchless code, e.g. by predicating the relevant instructions that make up the body of an if-statement (also if-then-else statements). Again, you will be able to see in the disassembly whether branches (BRA) are used or predication (e.g. @P0, @!P1).

So what’s the deal then with loop unrolling to help hide memory latency? Ch 3 of “Professional CUDA C Programming” has an example of a reduce algorithm where it uses a template pattern similar to mine above that achieves an 8x performance improvement just from doing the unrolling. I had thought this was due to ILP but is there something else going on?

Scheduling loads early can help cover memory latencies, which tend to be in the many hundreds of cycles for GPUs. This is more important if the number of threads running on an SM concurrently is small (i.e. low occupancy) meaning latency tolerance through thread switching is impaired. Various coding techniques help increase the “mobility” of loads during the compiler’s instruction scheduling process. E.g.

(1) Completely unrolling loops, resulting in straight-line code. The compiler may do this automatically based on various heuristics, or you can try to force it with an unrolling pragma

(2) Use of restrict pointer arguments. This is a promise to the compiler that there is no aliasing between the data reached through different pointers (and therefore no hidden dependencies). C/C++ allow aliasing, as opposed to Fortran, for example. Restricted pointers are an imperfect attempt to address the resulting performance issues (on both CPUs and GPUs)

Note that scheduling loads early tends to increase live ranges, driving up register pressure, which can cause lower occupancy and thus lower performance. With older GPU architectures this was a significant problem since no tweaking of the compiler heuristics was optimal for all code, and the cliffs could be fairly steep. With newer architectures (>= sm_50) which provide a lot of registers not so much of a problem anymore. I haven’t encountered it in the past three years, which doesn’t mean it could not affect someone’s code.

I think I get the idea of hiding latency by maximising active warps. My naive understanding is that this means a new warp can do something while the old warp is waiting for another operation to complete – say for its memory to finish loading.

The thing I never quite understood from that chapter is how loop unrolling helps you improve this. Wouldn’t putting more “work” into each thread reduce the number of active warps instead of increasing it?

Yes, I skipped over the warp execution mechanism when I mentioned thread-switching (as I stated, a conceptual simplification). NVIDIA’s own documentation provides sufficient information on it, I would say.

Loop unrolling typically removes work from each thread, by eliminating the instructions that control the loop (comparisons, branches, counter/pointer increments). That’s why it is used on all processors as an optimization. The flip-side is that excessive unrolling can lead (via larger code) to more instruction cache misses, which could be a more severe performance issue on CPUs than GPUs. Same goes for function inlining (saves on calls/returns/argument passign overhead, improves mobility of instructions during the scheduling pahse of the compiler).

The number of loads issued by each thread isn’t affected by the decision to unroll a loop (unless it leads to other changes to the code as well; I haven’t studied your particular example in detail). But it usually affords the compiler greater freedom in scheduling the loads early inside the dynamic instruction stream. The distance between the load and the first subsequent use of the load data is what enhances the latency tolerance (the larger, the better).

I am not familiar with the book you are using, but I have noticed that it seems to give rise to more questions in this forum than other CUDA-related publications. Maybe that’s because it is a very popular work, but I suspect it is at least partially a function of the authors not explaining concepts well, causing some confusion among the readers.

My recommendation for a CUDA book would be “The CUDA Handbook” by Nicholas Wilt. Nick was one of the first handful of engineers who created CUDA (as was I), and I find his writing to be exceptionally clear and concise. No, I am not getting any kickbacks for saying that. I also realize that different books on CUDA provide a different focus, and cater to different readers.

I might be using the wrong terminology then. What they referred to as “unrolling” was loading and processing 8 times as many values per thread and having 1/8th as many threads. In their reduce example they showed an 8x performance increase, which was counter-intuitive to me as this seems to be reducing active warps. On the other hand, if the problem is large enough that you’ll max out all your SMs either way, could doing more loads/work per thread streamline things?

That’s quite likely the case, I often feel confused by what they’re saying, even though they’re showing very material performance differences. What books/sources would you recommend for learning the nuances of CUDA performance optimisation?

Based on this description, I would not call this technique unrolling.

I learned CUDA as it was being created, so I learned much about CUDA-specific optimizations by experimenting and would encourage such exploratory work. I haven’t looked for books specifically focusing on hardcore CUDA optimization techniques, but NVIDIA’s own Best Practices Guide should make for an excellent starting point.