I am experiencing some strange effects for my kernel function:
/*
* Matrix A = M x N, B = N x K, C = M x K => OUT = M x K
*/
__global__ void wmma_matrix_mult_bColMajor(int N, int K, const half *a, const half *b, half *out) {
// Declare the fragments
wmma::fragment<wmma::matrix_a, TENSOR_SIZE, TENSOR_SIZE, TENSOR_SIZE, half, wmma::row_major> a_frag;
wmma::fragment<wmma::matrix_b, TENSOR_SIZE, TENSOR_SIZE, TENSOR_SIZE, half, wmma::col_major> b_frag;
wmma::fragment<wmma::accumulator, TENSOR_SIZE, TENSOR_SIZE, TENSOR_SIZE, half> c_frag;
const int outRowFragmentIdx = blockIdx.x / (K / TENSOR_SIZE);
const int outColFragmentIdx = blockIdx.x % (K / TENSOR_SIZE);
const int fragmentsToSum = N / TENSOR_SIZE;
// Initialize the output to zero
wmma::fill_fragment(c_frag, 0.0f);
for (int i = 0; i < fragmentsToSum; ++i) {
const int aRowFragmentIdx = outRowFragmentIdx;
const int aColFragmentIdx = i;
const int bRowFragmentIdx = i;
const int bColFragmentIdx = outColFragmentIdx;
// Load the inputs
wmma::load_matrix_sync(a_frag, &a[aRowFragmentIdx * TENSOR_SIZE * N + aColFragmentIdx * TENSOR_SIZE], N);
wmma::load_matrix_sync(b_frag, &b[bColFragmentIdx * TENSOR_SIZE * N + bRowFragmentIdx * TENSOR_SIZE], N);
// Perform the matrix multiplication
wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
}
// Store the output
wmma::store_matrix_sync(&out[outRowFragmentIdx * TENSOR_SIZE * K + outColFragmentIdx * TENSOR_SIZE], c_frag, K, wmma::mem_row_major);
}
This multiplies two matrices A and B (both dimensions have to be multiple of 16).
The function works for example with sizes up to 16x80 multiplied with 3600x80.
But for 16x80 multiplied with 15376x80 and run via compute-sanitizer ./main
I get 37 errors like this:
Invalid global write of size 4 bytes
========= at 0x1270 in kernels.h:110:wmma_matrix_mult_bColMajor(int,int,__half const *,__half const ,__half)
========= by thread (191,0,0) in block (1945,0,0)
========= Address 0x7f3a269259cc is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:cuLaunchKernel [0x7f3a5eb355d8]
========= in /usr/lib/x86_64-linux-gnu/libcuda.so.1
========= Host Frame: [0x55ad63c962cb]
With 16x80 multiplied with 63504x80 I get 25 of those errors.
I am wondering how can this be caused by my code because I don’t write any memory - the wmma functions do. And 4 bytes sound similar to the 4x4 matrix multiplications the Tensor Cores are doing internally?
Anyone here who knows how I can get more detailed debug messages or even knows why this happens?
Any hints / tips are greatly appreciated.