Tensor core "Invalid __global__ write of size 4 bytes"

I am experiencing some strange effects for my kernel function:

/*
 *  Matrix A = M x N, B = N x K, C = M x K => OUT = M x K
 */
__global__ void wmma_matrix_mult_bColMajor(int N, int K, const half *a, const half *b, half *out) {

    // Declare the fragments
    wmma::fragment<wmma::matrix_a, TENSOR_SIZE, TENSOR_SIZE, TENSOR_SIZE, half, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, TENSOR_SIZE, TENSOR_SIZE, TENSOR_SIZE, half, wmma::col_major> b_frag;
    wmma::fragment<wmma::accumulator, TENSOR_SIZE, TENSOR_SIZE, TENSOR_SIZE, half> c_frag;

    const int outRowFragmentIdx = blockIdx.x / (K / TENSOR_SIZE);
    const int outColFragmentIdx = blockIdx.x % (K / TENSOR_SIZE);

    const int fragmentsToSum = N / TENSOR_SIZE;

    // Initialize the output to zero
    wmma::fill_fragment(c_frag, 0.0f);

    for (int i = 0; i < fragmentsToSum; ++i) {

        const int aRowFragmentIdx = outRowFragmentIdx;
        const int aColFragmentIdx = i;

        const int bRowFragmentIdx = i;
        const int bColFragmentIdx = outColFragmentIdx;

        // Load the inputs
        wmma::load_matrix_sync(a_frag, &a[aRowFragmentIdx * TENSOR_SIZE * N + aColFragmentIdx * TENSOR_SIZE], N);
        wmma::load_matrix_sync(b_frag, &b[bColFragmentIdx * TENSOR_SIZE * N + bRowFragmentIdx * TENSOR_SIZE], N);

        // Perform the matrix multiplication
        wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
    }

    // Store the output
    wmma::store_matrix_sync(&out[outRowFragmentIdx * TENSOR_SIZE * K + outColFragmentIdx * TENSOR_SIZE], c_frag, K, wmma::mem_row_major);
}

This multiplies two matrices A and B (both dimensions have to be multiple of 16).

The function works for example with sizes up to 16x80 multiplied with 3600x80.
But for 16x80 multiplied with 15376x80 and run via compute-sanitizer ./main I get 37 errors like this:

Invalid global write of size 4 bytes
========= at 0x1270 in kernels.h:110:wmma_matrix_mult_bColMajor(int,int,__half const *,__half const ,__half)
========= by thread (191,0,0) in block (1945,0,0)
========= Address 0x7f3a269259cc is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:cuLaunchKernel [0x7f3a5eb355d8]
========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame: [0x55ad63c962cb]

With 16x80 multiplied with 63504x80 I get 25 of those errors.

I am wondering how can this be caused by my code because I don’t write any memory - the wmma functions do. And 4 bytes sound similar to the 4x4 matrix multiplications the Tensor Cores are doing internally?

Anyone here who knows how I can get more detailed debug messages or even knows why this happens?

Any hints / tips are greatly appreciated.

“Honest gov, I wasn’t running off the road, the car did that”. Yeah, but you were at the steering wheel.

The first thing you would want to do is identify the line in the source code that gives rise to those errors. Then reduce the problem size as much as possible, that is, such that the problem still manifests itself. Now ponder carefully the data operated on in the code at fault. Are pointers, sizes, strides, dimensions specified correctly? Do they correctly correspond to the size of underlying allocation(s)?

Does looking at line 110 in your kernels.h file shed any light on it? (I’m guessing it is the store_matrix_sync line). A global write would presumably only happen there, anyway.

The memory regions that your code will access are based on the pointers you pass. It seems odd to ask for help in this way while showing nothing at all about what the allocation looks like for a, b, and out (as above, out would be the focus for the specific error). I think you’re more likely to get useful help with a short, complete example.

Do as you wish of course. Good Luck!

I guess that, coupled with this index calculation:

and then compared to whatever you actually allocated for out in host code, might yield some useful info. (you would need to take into account the storage footprint for store_matrix_sync as well).

@Robert_Crovella

Sorry but what do you mean with "storage footprint of store_matrix_sync"?

Could someone please explain to me in more detail what Invalid __global__ write of size 4 bytes actually means?

Does it mean that the kernel function (or underlying functions) tried to write exactly 4 bytes to a location that didn’t contain any memory?

Could this have something to do with bad setup of blocks & threads from my side? I don’t understand why it doesn’t happen for smaller matrices - interestingly enough the result in out is still correct!

The last error after the invalid global write I get from compute-sanitizer is

========= Program hit unspecified launch failure (error 719) on CUDA API call to cudaDeviceSynchronize.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame: [0x7f4a0cb0ccc3]
=========                in /usr/lib/x86_64-linux-gnu/libcuda.so.1
=========     Host Frame: [0x5580492861d2]
=========                in /main
=========     Host Frame: [0x5580492581b5]
=========                in /main
=========     Host Frame: [0x558049255c3b]
=========                in /main
=========     Host Frame:__libc_start_main [0x7f4a19244c87]
=========                in /lib/x86_64-linux-gnu/libc.so.6
=========     Host Frame: [0x558049254d9a]
=========                in /main

src/kernels.h:110 actually is this line:

for (int i = 0; i < fragmentsToSum; ++i) {

I don’t understand why this would cause an error.

Another thing, if I run

wmma_matrix_mult_bColMajor<<<19845, 1024>>> then I get 25 or 41 or 57 times the global 4 byte write error.

If I run

wmma_matrix_mult_bColMajor<<<19845, 32>>> then I get the error 505 or 657 times.

So it looks like the number of threads I determine has something to do with it which is weird because I don’t handle the threadIdx.x variable in my kernel at all - should I and if so, how?

Also, why does the number of errors change with the exact same setup?

I don’t really know how to determine the number of threads for tensor cores / WMMA. I once read that it utilises 32 threads per block so it should be enough to use 32 threads - shouldn’t it?

Yes, a write, to global memory, four bytes in size. No, there was memory there. But that memory was not part of the memory allocated for data objects belonging to the kernel. For example, if for an array float a[10] the kernel accesses a[10], that’s an out-of-bounds access. There could also be (especially when dereferencing pointers) access to a memory page belonging to some other process, in which case it is a protection fault (akin to a segfault in Linux or GPF in Windows).

If you have used valgrind to check host code before, this kind of report should look familiar. If you haven’t used valgrind before, now would be a good time to become acquainted with it.

I’ve read that valgrind doesn’t work well with CUDA code because it could report false-positives. But I will try it.

In general valgrind is prone to false positives. And it used to be true (I do not know the current status, as I no longer produce software for a living) that it would exhibit an increased propensity to report false positives when used with CUDA applications. I vaguely recall this had to do with internals of the kernel launch mechanism; that could well be a false memory, though.

Nonetheless, valgrind is an indispensable part of the software engineer’s toolbox, and cuda-memcheck, while arguably not quite as sophisticated as valgrind (which has been around longer) provides similar functionality for device code.

My point was that the format used by valgrind to report out-of-bounds accesses is quite similar to the format used by cuda-memcheck to report these.

Valgrind just notes a lot of libcuda memory leaks of size 16 bytes:

https://pastebin.com/egVwJscW

Did I maybe select the wrong architecture? I am using an RTX2080 and compile like this:

$(NVCC) -arch=sm_75 -c $(SRC)/main.cu -I$(CUDNN)/include -o $(BUILD)/main.o

Sorry to everyone.

This was definitely my mistake - I allocated too many blocks which is why the kernel “thought” the matrices are bigger than they actually were.