Get wrong result using tensor core example

I’m learning gemm using tensorcores. I find the example on cuda C++ programming guide.

But I found the result using tensor core is wrong compared with the result using pytorch.

Why is that? Is there anything wrong with my code? Thank you!

#include <vector>
#include <stdio.h>
#include <mma.h>
#include <cuda_fp16.h>

using namespace nvcuda;

__global__ void wmma_ker(half *a, half *b, float *c) {
   wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
   wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
   wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
   wmma::fill_fragment(c_frag, 0.0f);
   wmma::load_matrix_sync(a_frag, a, 16);
   wmma::load_matrix_sync(b_frag, b, 16);
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
   wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
}

void test_gpu(int m, int n, int k) {
        std::vector<__half> a, b;

        a.resize(m * k);
        b.resize(n * k);

        std::vector<float> c;
        c.resize(m * n, 0);

        printf("A:\n");
        for(int i = 0; i < m * k; i++) {
                a[i] = i % 5;
                printf("%.2f\t", (float)a[i]);
                if((i+1)%n == 0) printf("\n");
        }

        printf("B:\n");
        for(int i = 0; i < n * k; i++) {
                b[i] = i % 5 + i % 4;
                printf("%.2f\t", (float)b[i]);
                if((i+1)%n == 0) printf("\n");
        }

        __half * d_a, *d_b;
        float *d_c;
        cudaMalloc(&d_a, sizeof(__half) * m * k);
        cudaMalloc(&d_b, sizeof(__half) * n * k);
        cudaMalloc(&d_c, sizeof(float) * m * n);

        //transpose(a.data(), m, k);
        //transpose(b.data(), k, n);
        cudaMemcpy(d_a, a.data(), sizeof(__half) * m * k, cudaMemcpyHostToDevice);
        cudaMemcpy(d_b, b.data(), sizeof(__half) * n * k, cudaMemcpyHostToDevice);
        //gemm_cpu(a.data(), b.data(), c.data(), m, n, k);
        wmma_ker<<<1, 32>>>(d_a, d_b, d_c);

        cudaMemcpy(c.data(), d_c, sizeof(float) * m * n, cudaMemcpyDeviceToHost);

        printf("C output:\n");
        for(int i = 0; i < m * n; i++) {
                printf("%.2f\t", (float)c[i]);
                if((i+1)%n == 0) printf("\n");
        }
        printf("\n");
}

int main() {
        test_gpu(16, 16, 16);
        return 0;
}

import torch

a = [i%5 for i in range(0, 16 * 16)]
a = torch.Tensor(a).reshape(-1, 16)

b = [i%5 + i % 4 for i in range(0, 16 * 16)]
b = torch.Tensor(b).reshape(-1, 16)

print(a)
print(b)
print(torch.matmul(a, b))

C++ output

python output

change this:

wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
                                                       ^^^

to this:

wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::row_major> b_frag;
                                                       ^^^

I note that the example you seem to be referring to actually has col_major for a_frag and row_major for b_frag, just the opposite of what your code shows, so it seems you have already modified these.

This may also be of interest. Those settings affect behavior, as you have now discovered. From here:

The Layout parameter must be specified for matrix_a and matrix_b fragments. row_major or col_major indicate that elements within a matrix row or column are contiguous in memory, respectively.

1 Like

Thank you! I made a mistake!!

It worked! the result is right currently!

I’m new to tensor core usage, I notice there are two sets of APIs for tensor core: cuda cpp api and ptx api.

Why there are two sets of it?

I tryout the c++ api, and find the CUDA C++ api is not transparent to users because the storage details in fragment is unknown, and seems the ptx api has description of the data storage across threads?

Thank you!

And also, when load from shared memory and writing data to shared memory, there seems to be bank conflicts using c++ api?

I know little about these, hope for some knowledge, thank you!!!

And I have another question, how can I see the utilization of tensorcore using cli tools?

I try ncu --list-metrics, but I can’t find a metric about tensor core.

In general, PTX is part of the NVIDIA toolchain. The existence of PTX in addition to CUDA C++ provides a number of capabilities for CUDA development. So it’s not surprising that PTX exists. By design, nearly anything you can do in CUDA C++, you can do in PTX. Specifically with respect to tensorcore usage, PTX provides a superset of what can be done or is “exposed” in CUDA C++.

Correct. There are at least 3 different subsets of instructions in PTX that can access Tensorcore. The wmma instructions are the ones exposed in CUDA C++. They have the opaque load/store setup you are referring to, both at the PTX level and at the CUDA C++ level. PTX also includes a mma instruction, which has the register footprint “exposed”. This is not available in CUDA C++, currently. One reason might be that CUDA C++ doesn’t normally have “registers” in view, although I am not suggesting that is a complete or bulletproof description. I cannot always provide good answers to “why is it that way?” questions.

There are tensorcore metrics, such as discussed in this blog. You can also get an “overview” of tensorcore “pipe” usage in the nsight compute SOL output in the “compute throughput breakdown” table (depicted in the video at about the 9 minute mark e.g. “SM Pipe Tensor Cycles Active”).

1 Like

Thank you! I think the answer is very clear.
Now I need to know more about the ptx api.

Besides the PTX documentation, have a look at cutlass and either use cutlass for calling the PTX API or imitate, how cutlass inlines and invokes the mma instructions.
In the following I describe the second way = how you can call the mma instructions by yourself:

You can do it in a similar way as cutlass: cutlass/include/cutlass/arch at main · NVIDIA/cutlass · GitHub

Look at the mma*.h headers. The asm volatile parts are the important one, the other code, e.g. the using lines, mostly is cutlass-specific template magic for easier calling and selecting types.
The A, B, C and D arrays can also be replaced with a named variable for each element, e.g. A[0]A0, A[1]A1, …
Put those assembler calls into C++ inline functions. It is possible for the caller of the C++ function to use the same variable(s) for input and output (C = D).

Example code for a customized call:

#if __CUDA_ARCH__ >= 800
__forceinline__ __device__ void mma_sp_m16n8k32_satfinite_s32_s8_s8_s32(int& D0, int& D1, int& D2, int& D3, const unsigned int A0, const unsigned int A1, const unsigned int B0, const unsigned int B1, const int C0, const int C1, const int C2, const int C3, const unsigned int E)
{
	asm volatile("mma.sp.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32.satfinite {%0, %1, %2, %3}, {%4, %5}, {%6, %7}, {%8, %9, %10, %11}, {%12};"
		: "=r"(D0), "=r"(D1), "=r"(D2), "=r"(D3)
		: "r"(A0), "r"(A1), "r"(B0), "r"(B1), "r"(C0), "r"(C1), "r"(C2), "r"(C3), "r"(E));
}
#endif

You have to read up in the PTX documentation, which thread has which matrix element in which of the parameters. Once you get the hang of it, it is really easy to use.

Those are deterministic and synchronizing math functions; if the correct data is entered, it reliably gives out the result. So typically no lingering bugs; once it works, it works. And fast it works :-)

Thank you so much ! I’m learning cutlass this week.

I think it’s new api cute is hard to learn.

I wonder why the cutlass 3.x release a new api. Anyway, I’m getting on the way.

When I’m more familiar with ptx and tensor core, I’ll try this.

Thank you for the good suggestion!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.