Debugging hints required

rreddy78 · December 18, 2020, 6:22am

hello all,

I have run into a strange problem with my CUDA code for matrix multiplication.
Its an implementation where a block 64 1D threads computes 64x64 elements by using register tiling for one matrix and shared memory for the other.

The blocks compute correctly except for last row of last N-1 blocks when N is the number of blocks on x dim. The difference between two consecutive blocks is always same: 896.

If for example i try 256x256 dim then

rreddy78@jetson-nano:~/projects/test$ ./matrixMulTiled8
C[256 x 256]. grid config = (4,4) for block config = (64x1)

Difference between two is: 896
Error greater than tolerance at 255,64: 3712 - 255,127: 3712
Error greater than tolerance at 255,128: 2816 - 255,191: 2816
Error greater than tolerance at 255,192: 1920 - 255,255: 1920

if I use 512x512 then:

rreddy78@jetson-nano:~/projects/test$ ./matrixMulTiled8
C[512 x 512]. grid config = (8,8) for block config = (64x1)

Difference between two is: 896
Error greater than tolerance at 511,64: 8320 - 511,127: 8320
Error greater than tolerance at 511,128: 7424 - 511,191: 7424
…
Error greater than tolerance at 511,448: 2944 - 511,511: 2944

How should we approach debugging for such a problem ? The same kernel code works for all other blocks and nothing is hard coded.

__global__ void matrixMultiplicationKernel(const float *__restrict__ A, const float *__restrict__ B, float *__restrict__ C, const int M_Size, const int W_Size, const int N_Size)
{
    // blockDim.y = 1 and threadIdx.y = 0
    const int ROW = (TILE_WIDTH * blockIdx.y) * blockDim.y + threadIdx.y; // Absolute C row index
    const int COL = blockIdx.x * blockDim.x + threadIdx.x;                // Absolute C column index

    __shared__ float Ads[TILE_WIDTH][TILE_WIDTH];

    register float BRT[TILE_WIDTH];          // B Register Tile
    register float tmpSum[TILE_WIDTH] = {0}; // tmpSum in registers

    for (int tileOffset = 0; tileOffset < W_Size; tileOffset += TILE_WIDTH)
    {
        // Load the B Tile into registers and A Tile into shared memory
        //#pragma unroll
        for (int idx = 0; idx < TILE_WIDTH; idx++)
        {
            BRT[idx] = B[((tileOffset + idx) * N_Size) + COL];
        }

        //#pragma unroll
        for (int idx = 0; idx < TILE_WIDTH; idx++)
        {
            Ads[idx][threadIdx.x] = A[((idx + ROW) * W_Size) + (tileOffset + COL)];
        }

        __syncthreads();

        for (int idx = 0; idx < TILE_WIDTH; idx++)
        {
            //#pragma unroll
            for (int idx2 = 0; idx2 < TILE_WIDTH; idx2++)
                tmpSum[idx] += Ads[idx][idx2] * BRT[idx2];

            __threadfence();
        }

        __syncthreads();
    }

    for (int idx = 0; idx < TILE_WIDTH; idx++)
    {
        C[((ROW + idx) * N_Size) + COL] = tmpSum[idx];
    }
}

void matrixMultiplication(const float *__restrict__ A, const float *__restrict__ B, float *__restrict__ C, const int M_Size, const int W_Size, const int N_Size)
{
    // declare the number of blocks per grid and the number of threads per block
    dim3 threadsPerBlock(TILE_WIDTH, 1);
    dim3 blocksPerGrid(1, 1);

    blocksPerGrid.x = (N_Size + (threadsPerBlock.x - 1)) / threadsPerBlock.x;
    blocksPerGrid.y = (M_Size + (TILE_WIDTH - 1)) / TILE_WIDTH;

    printf("C[%d x %d]. grid config = (%d,%d) for block config = (%dx%d)\n", M_Size, N_Size, blocksPerGrid.x, blocksPerGrid.y, threadsPerBlock.x, threadsPerBlock.y);
    matrixMultiplicationKernel<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, M_Size, W_Size, N_Size);
    checkCuda(cudaDeviceSynchronize());
}

rreddy78 · December 18, 2020, 1:37pm

Even if I change from local arrays to shared memory (for variables tmpSum and BRT), the result in exactly the same.

Robert_Crovella · December 18, 2020, 7:10pm

Here’s the sequence I would follow:

Populate your arrays with a data pattern that makes it easy to understand results. For this I used setting the A matrix to all 1.0 and setting the B matrix to 0 1 2 3 0 1 2 3 etc.
Identify the smallest problem size that shows the problem. In my case it appeared to be 128x128 (2x2 blocks).
Select a particular element that is showing an incorrect result. Identify the ROW and COL for that element. In my case one element that showed an issue was in the last matrix line, at element (ROW + idx) = 127 and COL = 97
Using (ROW+idx) and COL determined in step 3, set some printf statements in your kernel which print out the sequence of values used to compute that item. You already have the necessary loops and whatnot, so this can basically be a 1-line addition to your kernel code.

This should get you a long way towards discovering the problem. When I did that, I discovered, to my surprise, that the Ads values that were used during the tempSum accumulation had some values that were not 1.0f (!). Tracking this down further eventually led me to this line of code:

 Ads[idx][threadIdx.x] = A[((idx + ROW) * W_Size) + (tileOffset + COL)];

I’ll stop there and leave it for you unless you want me to explain exactly the issue and solution.

Mentally I think of this approach as “grabbing the tiger by the tail and see where it leads you”. Identify an aspect of the result that you know is wrong. Successively work backwards through the calculations that led to that (wrong) element. You can do this with printf. You can do it similarly with a debugger and conditional breakpoints, but either way it will generally involve multiple “passes” as you work backwards until the root cause becomes evident.

rreddy78 · December 19, 2020, 7:25am

Thank you!.
I will use this technique as cuda-gdb does not work for me…

Corrected the logical error in the line:
Ads[idx][threadIdx.x] = A[((idx + ROW) * W_Size) + (tileOffset + threadIdx.x)];

Topic		Replies	Views
Can you help me find the bug? CUDA Programming and Performance	4	453	August 13, 2023
Matrix multiplication CUDA Programming and Performance	7	2173	July 2, 2010
matrix multiplication--wrong answer CUDA Programming and Performance	6	3806	August 20, 2009
Problem with simple Matrix Multiplication using shared memory CUDA Programming and Performance	0	2315	May 26, 2011
Desperate help needed! Sum rows of matrix CUDA Programming and Performance	16	40628	April 24, 2011
Matrix Multiplication Buggy CUDA Programming and Performance	13	5249	May 5, 2010
CUDA Matrix Multiplication: One thread computes multiple elements CUDA Programming and Performance	4	5067	December 28, 2014
Why my tiled matrix multiplication has no acceleration over the naive cuda version? CUDA Programming and Performance	1	276	February 1, 2024
matrix multiplication CUDA Programming and Performance	10	3852	March 7, 2010
Subtle problem with matrix-matrix multiplication CUDA Programming and Performance	10	1229	December 12, 2020

Debugging hints required

Related topics