I’m reluctant to point the finger at the compiler, but this behaviour seems suspicious, especially since a trivial and seemingly equivalent change resolves the problem for no clear reason.

For now I’ll just post the kernel, but I can give an example of host execution to if necessary.

As the code comments mention, the kernel takes in an m x n matrix A and an m-length vector v, and to every element in a given row of A adds the corresponding element from v in-place. I’ve tried to optimize a bit using some loop unrolling.

```
// takes A(m x n) and v(m); adds an element from v in-place to each row of A.
// note, it is assumed that:
// int block_width = (n + blocks_per_row - 1) / blocks_per_row; // number of elements within a single row a block calculates
// int unroll_stride = (block_width + unroll_factor - 1) / unroll_factor;
// blockDim.x = (block_width + unroll_factor -1) / unroll_factor;
template <unsigned int unroll_factor>
__global__ void add_by_rows_kernel(const int m, const int n, Real * A, const Real * v, const int blocks_per_row, const int block_width, const int unroll_stride)
{
int this_row = blockIdx.x / blocks_per_row;
int curr_block_this_row = blockIdx.x % blocks_per_row;
// calculate pointer to first element applicable to this thread
Real * starting_point = A + n * this_row // beginning of row
+ block_width * curr_block_this_row // beginning of this block
+ threadIdx.x; // first element within sub-block for this thread
// choose largest max_i s.t.
// block_width * curr_block_this_row + threadIdx.x + (max_i - 1)* unroll_stride < n
int max_i = (n - block_width * curr_block_this_row - threadIdx.x + unroll_stride - 1) / unroll_stride;
// get the element we're adding and store it in shared memory
__shared__ Real ele;
ele = v[this_row];
// register array for intermediary calculations
Real arr[unroll_factor];
// get data
#pragma unroll
for (int i = 0; i < unroll_factor; ++i)
{
if (i < max_i)
{
arr[i] = starting_point[i * unroll_stride];
}
}
// process data
#pragma unroll
for (int i = 0; i < unroll_factor; ++i)
{
if (i < max_i)
{
arr[i] += ele;
}
}
// store data
#pragma unroll
for (int i = 0; i < unroll_factor; ++i)
{
if (i < max_i)
{
starting_point[i * unroll_stride] = arr[i];
}
}
}
```

When I debug it in Nsight with the memory checker on, it flags a bunch of memory violations. This is happening within the for loops due to the fact that the bounds checking logic

```
if (i < max_i)
```

is not working correctly. Specifically, a watch on max_i reveals a ridiculously large positive

number in cases where the expression

```
int max_i = (n - block_width * curr_block_this_row - threadIdx.x + unroll_stride - 1) / unroll_stride;
```

should actually be negative, and therefore stop memory access completely for those threads. (Yes, I am currently running with more threads than the matrix size requires.)

What’s especially strange is that the problem can be trivially resolved in one of 2 ways:

- Breaking up the calculation of max_i into

```
int max_i_num = n - block_width * curr_block_this_row - threadIdx.x + unroll_stride - 1;
int max_i_den = unroll_stride;
int max_i = max_i_num / max_i_den;
```

This stops the memory access errors, and when I break I can see that max_i is correctly negative in the cases where it should be.

- Get rid of max_i completely and replace all the if statements with the more verbose (and computationally expensive)

```
if (block_width * curr_block_this_row + threadIdx.x + i* unroll_stride < n)
```

Again this stops the memory access errors.

I’m hoping that this can either be confirmed as some kind of compiler optimization bug, or some explanation given for the problem (and why either of these simple changes resolves it).