Use of ldmatrix

half-0 · December 7, 2024, 7:31am

I don’t understand the mechanism of ldmatrix operation. Take

ldmatrix.sync.aligned.m8n8.x2.trans.shared.b16 {d0, d1}, [addr];

as an example.
Is that means only thread 0-15 will be activate during the operation?

half-0 · December 7, 2024, 7:48am

If only thread 0-15 will be activated during the time, why will there be bank conflict during the ldmatrix?
Think about a condition:

__shared__ A[128][16];
// use ldmatrix to load A[idx0][0-16], A[idx1][0-16], ... A[idx7][0-16],
// idx0, idx1, ... idx7 are dependent on the input data.

Will the condition cause bank conflict? If so, is there any way to avoid it?

striker159 · December 7, 2024, 7:54am

It is described in the ptx documentation. PTX ISA 8.5

All threads are active. 8x8 matrix has 64 half elements. Four threads are used per row (loading 16 bytes) → 32 threads are used.
The input row pointers are distributed across the warp. For m8n8.x2 16 rows are loaded. The corresponding pointer values are taken from threads 0-15.

half-0 · December 7, 2024, 8:02am

Thanks, but I still have no idea if there will be bank conflict in ldmatrix.

half-0 · December 7, 2024, 8:03am

If there still have bank conflict when using ldmatrix, why do we use this operation? We can manipulate the data directly, which is a more flexiable way.

striker159 · December 7, 2024, 8:13am

I would assume the ordinary rules for bank conflicts apply here as well. You could set up a simple experiment and profile it.

If the register layout of the matrix matches your use-case, ldmatrix can load multiple matrices in a single instruction. With manual loading, you would need n instructions for n matrices.

striker159 · December 8, 2024, 12:53pm

Here is an example code for ldmatrix x2. When profiled with ncu, it shows bank conflicts.

//nvcc -g -lineinfo -std=c++17 -arch=native main.cu -o main

#include <iostream>
#include <thrust/device_vector.h>

/*
ldmatrix.sync.aligned.shape.num{.trans}{.ss}.type r, [p];

.shape  = {.m8n8};
.num    = {.x1, .x2, .x4};
.ss     = {.shared{::cta}};
.type   = {.b16};
*/

__device__ 
void ldmatrix_x2(unsigned int (&x)[2], const void* ptr){ 
    asm volatile("ldmatrix.sync.aligned.m8n8.x2.shared.b16 {%0, %1}, [%2];" 
        : "=r"(x[0]), "=r"(x[1])
        : "l"(__cvta_generic_to_shared(ptr))); 
}


__global__
void mykernel(const int* loadOffsets, bool print){
    alignas(16) __shared__ half A[128 * 16];

    for(int i = threadIdx.x; i < 128*16; i += blockDim.x){
        A[i] = i;
    }
    __syncthreads();

    const int lane = threadIdx.x % 32;

    unsigned int result[2];
    const int offset = loadOffsets[lane];
    ldmatrix_x2(result, &A[offset]);

    half2 loaded[2];
    memcpy(&loaded[0], &result[0], sizeof(half2) * 2);

    if(print){
        for(int m = 0; m < 2; m++){
            for(int t = 0; t < 32; t++){
                if(lane == t){
                    printf("%4d %4d ", int(loaded[m].x), int(loaded[m].y));
                    if(lane % 4 == 3){
                        printf("\n");
                    }
                }
                __syncwarp();
            }
            if(lane == 0){
                printf("\n");
            }
            __syncwarp();
        }
    }
}



int main(){
    thrust::device_vector<int> d_loadOffsets(32, 0);
    for(int i = 0; i < 16; i++){
        const int row = i % 8;
        const int matrix = i / 8;
        d_loadOffsets[i] = row * 16 + matrix * 8;
    }
    mykernel<<<1,32>>>(d_loadOffsets.data().get(), true);
    cudaDeviceSynchronize();



    // Shared Load Matrix: Requests 16.384, Wavefronts 33.393, Bank Conflicts 0
    for(int i = 0; i < 16; i++){
        const int row = i / 2;
        const int matrix = i % 2;
        d_loadOffsets[i] = row * 16 + matrix * 8;
    }
    std::cout << "offsets: ";
    for(int i = 0; i < 16; i++){
        std::cout << d_loadOffsets[i] << " ";
    }
    std::cout << "\n";
    mykernel<<<1024,512>>>(d_loadOffsets.data().get(), false);
    cudaDeviceSynchronize();

    // Shared Load Matrix: Requests 16.384, Wavefronts 131.674, Bank Conflicts 98.304
    for(int i = 0; i < 16; i++){
        const int row = i / 2;
        const int matrix = i % 2;
        d_loadOffsets[i] = (4*row) * 16 + matrix * 8;
    }
    std::cout << "offsets: ";
    for(int i = 0; i < 16; i++){
        std::cout << d_loadOffsets[i] << " ";
    }
    std::cout << "\n";
    mykernel<<<1024,512>>>(d_loadOffsets.data().get(), false);
    cudaDeviceSynchronize();

    // Shared Load Matrix: Requests 16.384, Wavefronts 66.488, Bank Conflicts 32.768
    for(int i = 0; i < 16; i++){
        const int row = i % 8;
        const int matrix = i / 8;
        d_loadOffsets[i] = row * 16 + matrix * 8;
    }
    std::cout << "offsets: ";
    for(int i = 0; i < 16; i++){
        std::cout << d_loadOffsets[i] << " ";
    }
    std::cout << "\n";
    mykernel<<<1024,512>>>(d_loadOffsets.data().get(), false);
    cudaDeviceSynchronize();

    // Shared Load Matrix: Requests 16.384, Wavefronts 263.070, Bank Conflicts 229.376
    for(int i = 0; i < 16; i++){
        const int row = i % 8;
        const int matrix = i / 8;
        d_loadOffsets[i] = (4*row) * 16 + matrix * 8;
    }
    std::cout << "offsets: ";
    for(int i = 0; i < 16; i++){
        std::cout << d_loadOffsets[i] << " ";
    }
    std::cout << "\n";
    mykernel<<<1024,512>>>(d_loadOffsets.data().get(), false);
    cudaDeviceSynchronize();
}

offsets: 0 8 16 24 32 40 48 56 |  64 72 80 88 96 104 112 120 
Shared Load Matrix: Requests 16.384, Wavefronts 33.393, Bank Conflicts 0

offsets: 0 8 64 72 128 136 192 200 | 256 264 320 328 384 392 448 456 
Shared Load Matrix: Requests 16.384, Wavefronts 131.674, Bank Conflicts 98.304

offsets: 0 16 32 48 64 80 96 112 | 8 24 40 56 72 88 104 120 
Shared Load Matrix: Requests 16.384, Wavefronts 66.488, Bank Conflicts 32.768

offsets: 0 64 128 192 256 320 384 448 | 8 72 136 200 264 328 392 456
Shared Load Matrix: Requests 16.384, Wavefronts 263.070, Bank Conflicts 229.376

Case 3 seems a bit strange to me. In total, the same offsets as in case 1 are accessed. However, it appears that the two sets of rows for the two matrices are processed independently which doubles the number of wavefronts and introduces bank conflicts.

(same observation for cases 4 and 2.)

Curefab · December 8, 2024, 4:02pm

Are all your offsets provided to ldmatrix aligned to 16 bytes boundaries?

Topic		Replies	Views
Padding of mma operation CUDA Programming and Performance	20	58	December 19, 2024
Understanding the behaivor of ldmatrix in terms of shared memory access CUDA Programming and Performance cuda	2	1472	January 12, 2024
Why will different numbers of thread blocks cause bank conflicts? CUDA Programming and Performance cuda , kernel , ubuntu	1	46	August 22, 2024
Ncu detects bank conflicts in matrix transposition after padding Nsight Compute cuda	5	1295	January 30, 2023
How to understand the bank conflict of shared_mem CUDA Programming and Performance	12	9982	January 16, 2025
Why my ldmatrix PTX instruction is wrong? CUDA Programming and Performance cuda	10	757	March 30, 2024
Load data for tensor core CUDA Programming and Performance	23	55	February 5, 2025
Shared memory bank conflict CUDA Programming and Performance	1	285	May 19, 2024
dont understand bank conflicts for shared mem CUDA Programming and Performance	7	2629	March 31, 2010
Matrix Transposition: Why are the number of Shared Store/Load Transactions different in a CUDA Kernel? Nsight Compute	2	691	July 5, 2022

Use of ldmatrix

Related topics