How to Load 4 Consecutive Values from Shared Memory into uint MultiA for MMA?

In MMA coding, we usually define the matrix A like this:

uint MultiA[2] = { 0 };
half* test1 = reinterpret_cast<half*>(MultiA);

The test1 pointer holds 4 consecutive half values.

If I want to directly read 4 consecutive values from shared memory (smem) and store them back into uint MultiA, how can I perform this conversion correctly?

What is the exact problem you are facing? Do you have conversion code which does not work? Can you show it?

1 Like

I am loading data from smem. Now I achieved it! This is what I want to do(I show it because I think it is a general question):

        half test1[4];
        uint* MultiA = reinterpret_cast<uint*>(test1);
        load test1[0] [1] [2] [3];

        asm volatile("mma.sync.aligned.m8n8k4.col.col.f16.f16.f16.f16 "
                     "{ %0, %1, %2, %3 },"
                     "{ %4, %5 },"
                     "{ %6, %7 },"
                     "{ %8, %9, %10, %11 };\n"
                     : "=r"(out[0]), "=r"(out[1]), "=r"(out[2]), "=r"(out[3])
                     : "r"(MultiA[0]), "r"(MultiA[1]),
                     "r"(MultiB[0]), "r"(MultiB[1]),
                     "r"(out[0]), "r"(out[1]), "r"(out[2]), "r"(out[3]));
__shared__ half smem[];

//if smem ptr is aligned to 8 bytes
unsigned int multiA_uint[2];
uint2 tmp = *reinterpret_cast<uint2*>(&smem[0]);
memcpy(&multiA_uint[0], &tmp, sizeof(uint2));
//if smem ptr is aligned to 2 bytes
half tmp[4];
tmp[0] = smem[0];
tmp[1] = smem[1];
tmp[2] = smem[2];
tmp[3] = smem[3];
memcpy(&multiA_uint[0], &tmp, sizeof(uint2));
1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.