Get wrong result using tensor core example

cuda_new_bird · June 13, 2024, 3:20am

I’m learning gemm using tensorcores. I find the example on cuda C++ programming guide.

But I found the result using tensor core is wrong compared with the result using pytorch.

Why is that? Is there anything wrong with my code? Thank you!

#include <vector>
#include <stdio.h>
#include <mma.h>
#include <cuda_fp16.h>

using namespace nvcuda;

__global__ void wmma_ker(half *a, half *b, float *c) {
   wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
   wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
   wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
   wmma::fill_fragment(c_frag, 0.0f);
   wmma::load_matrix_sync(a_frag, a, 16);
   wmma::load_matrix_sync(b_frag, b, 16);
   wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
   wmma::store_matrix_sync(c, c_frag, 16, wmma::mem_row_major);
}

void test_gpu(int m, int n, int k) {
        std::vector<__half> a, b;

        a.resize(m * k);
        b.resize(n * k);

        std::vector<float> c;
        c.resize(m * n, 0);

        printf("A:\n");
        for(int i = 0; i < m * k; i++) {
                a[i] = i % 5;
                printf("%.2f\t", (float)a[i]);
                if((i+1)%n == 0) printf("\n");
        }

        printf("B:\n");
        for(int i = 0; i < n * k; i++) {
                b[i] = i % 5 + i % 4;
                printf("%.2f\t", (float)b[i]);
                if((i+1)%n == 0) printf("\n");
        }

        __half * d_a, *d_b;
        float *d_c;
        cudaMalloc(&d_a, sizeof(__half) * m * k);
        cudaMalloc(&d_b, sizeof(__half) * n * k);
        cudaMalloc(&d_c, sizeof(float) * m * n);

        //transpose(a.data(), m, k);
        //transpose(b.data(), k, n);
        cudaMemcpy(d_a, a.data(), sizeof(__half) * m * k, cudaMemcpyHostToDevice);
        cudaMemcpy(d_b, b.data(), sizeof(__half) * n * k, cudaMemcpyHostToDevice);
        //gemm_cpu(a.data(), b.data(), c.data(), m, n, k);
        wmma_ker<<<1, 32>>>(d_a, d_b, d_c);

        cudaMemcpy(c.data(), d_c, sizeof(float) * m * n, cudaMemcpyDeviceToHost);

        printf("C output:\n");
        for(int i = 0; i < m * n; i++) {
                printf("%.2f\t", (float)c[i]);
                if((i+1)%n == 0) printf("\n");
        }
        printf("\n");
}

int main() {
        test_gpu(16, 16, 16);
        return 0;
}

import torch

a = [i%5 for i in range(0, 16 * 16)]
a = torch.Tensor(a).reshape(-1, 16)

b = [i%5 + i % 4 for i in range(0, 16 * 16)]
b = torch.Tensor(b).reshape(-1, 16)

print(a)
print(b)
print(torch.matmul(a, b))

C++ output

python output

Robert_Crovella · June 13, 2024, 5:05am

change this:

wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::col_major> b_frag;
                                                       ^^^

to this:

wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::row_major> b_frag;
                                                       ^^^

I note that the example you seem to be referring to actually has col_major for a_frag and row_major for b_frag, just the opposite of what your code shows, so it seems you have already modified these.

This may also be of interest. Those settings affect behavior, as you have now discovered. From here:

The Layout parameter must be specified for matrix_a and matrix_b fragments. row_major or col_major indicate that elements within a matrix row or column are contiguous in memory, respectively.

cuda_new_bird · June 13, 2024, 6:45am

Thank you! I made a mistake!!

It worked! the result is right currently!

I’m new to tensor core usage, I notice there are two sets of APIs for tensor core: cuda cpp api and ptx api.

Why there are two sets of it?

I tryout the c++ api, and find the CUDA C++ api is not transparent to users because the storage details in fragment is unknown, and seems the ptx api has description of the data storage across threads?

Thank you!

And also, when load from shared memory and writing data to shared memory, there seems to be bank conflicts using c++ api?

I know little about these, hope for some knowledge, thank you!!!

cuda_new_bird · June 13, 2024, 6:49am

And I have another question, how can I see the utilization of tensorcore using cli tools?

I try ncu --list-metrics, but I can’t find a metric about tensor core.

Robert_Crovella · June 13, 2024, 12:42pm

In general, PTX is part of the NVIDIA toolchain. The existence of PTX in addition to CUDA C++ provides a number of capabilities for CUDA development. So it’s not surprising that PTX exists. By design, nearly anything you can do in CUDA C++, you can do in PTX. Specifically with respect to tensorcore usage, PTX provides a superset of what can be done or is “exposed” in CUDA C++.

Correct. There are at least 3 different subsets of instructions in PTX that can access Tensorcore. The wmma instructions are the ones exposed in CUDA C++. They have the opaque load/store setup you are referring to, both at the PTX level and at the CUDA C++ level. PTX also includes a mma instruction, which has the register footprint “exposed”. This is not available in CUDA C++, currently. One reason might be that CUDA C++ doesn’t normally have “registers” in view, although I am not suggesting that is a complete or bulletproof description. I cannot always provide good answers to “why is it that way?” questions.

There are tensorcore metrics, such as discussed in this blog. You can also get an “overview” of tensorcore “pipe” usage in the nsight compute SOL output in the “compute throughput breakdown” table (depicted in the video at about the 9 minute mark e.g. “SM Pipe Tensor Cycles Active”).

cuda_new_bird · June 14, 2024, 2:03am

Thank you! I think the answer is very clear.
Now I need to know more about the ptx api.

Curefab · June 14, 2024, 7:16am

Besides the PTX documentation, have a look at cutlass and either use cutlass for calling the PTX API or imitate, how cutlass inlines and invokes the mma instructions.
In the following I describe the second way = how you can call the mma instructions by yourself:

You can do it in a similar way as cutlass: cutlass/include/cutlass/arch at main · NVIDIA/cutlass · GitHub

Look at the mma*.h headers. The asm volatile parts are the important one, the other code, e.g. the using lines, mostly is cutlass-specific template magic for easier calling and selecting types.
The A, B, C and D arrays can also be replaced with a named variable for each element, e.g. A[0] → A0, A[1] → A1, …
Put those assembler calls into C++ inline functions. It is possible for the caller of the C++ function to use the same variable(s) for input and output (C = D).

Example code for a customized call:

#if __CUDA_ARCH__ >= 800
__forceinline__ __device__ void mma_sp_m16n8k32_satfinite_s32_s8_s8_s32(int& D0, int& D1, int& D2, int& D3, const unsigned int A0, const unsigned int A1, const unsigned int B0, const unsigned int B1, const int C0, const int C1, const int C2, const int C3, const unsigned int E)
{
	asm volatile("mma.sp.sync.aligned.m16n8k32.row.col.s32.s8.s8.s32.satfinite {%0, %1, %2, %3}, {%4, %5}, {%6, %7}, {%8, %9, %10, %11}, {%12};"
		: "=r"(D0), "=r"(D1), "=r"(D2), "=r"(D3)
		: "r"(A0), "r"(A1), "r"(B0), "r"(B1), "r"(C0), "r"(C1), "r"(C2), "r"(C3), "r"(E));
}
#endif

You have to read up in the PTX documentation, which thread has which matrix element in which of the parameters. Once you get the hang of it, it is really easy to use.

Those are deterministic and synchronizing math functions; if the correct data is entered, it reliably gives out the result. So typically no lingering bugs; once it works, it works. And fast it works :-)

cuda_new_bird · June 17, 2024, 11:14am

Thank you so much ! I’m learning cutlass this week.

I think it’s new api cute is hard to learn.

I wonder why the cutlass 3.x release a new api. Anyway, I’m getting on the way.

When I’m more familiar with ptx and tensor core, I’ll try this.

Thank you for the good suggestion!

system · July 1, 2024, 11:15am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Differences in Precision Between Tensor Cores and CUDA Cores CUDA Programming and Performance cuda	1	385	January 10, 2025
While learning tensor core low level code example (using wmma), mismatch occurs between wmma and cublas results when comparing CUDA Programming and Performance	13	735	February 6, 2024
Programming Tensor Cores in CUDA 9 Technical Blog	14	1301	November 28, 2022
Error while using different precisions with tensor cores Deep Learning (Training & Inference) mixed-precision	2	2076	July 26, 2019
Tensor cores floating point inaccuracies CUDA Programming and Performance	0	529	September 16, 2020
Tensor Core Programming Using CUDA Fortran Technical Blog	0	459	August 25, 2020
Programming Tensor core in RTX4070 CUDA Programming and Performance	1	541	January 18, 2024
Can we directly use register value for tensor core calculation? CUDA Programming and Performance	4	775	October 18, 2023
PTX instruction `mma` not lowered to tensor core related SASS instruction TensorRT	2	1408	March 22, 2022
Benchmark result with vs. without tensor core GPU-Accelerated Libraries	7	216	February 15, 2025

Get wrong result using tensor core example

Related topics