Where can I find working examples for the new cuBLASLt library?

serge.weinstock · May 1, 2019, 2:45pm

For this test, I’ve used Visual Studio 2017.

I've created a new project using the "NVIDIA/CUDA 10.1" wizard.
I've copied your code into the project
I've copied helper_cuda.h and helper_string.h from the CUDA 10.1 samples
I've added cublasLt.lib to the list of libraries

cliff.burdick · May 1, 2019, 4:01pm

mnicely, I think I’ve found one of the sources of my confusion with your example. You have two different matrix descriptors, one for the transformed matrix, and one for the raw matrices. Your row major instruction is only on the raw matrices, and not on the transformed ones. This just means that the transformation step not only makes it planar, but also makes it row-major ordering. In the SGEMM example you are able to pass the descriptor with the row major order directly into cublasLtMatMul. In complex you cannot do that, or you get an error. So it seems if I’m going to use row major ordering without the cublasLtMatrixTransform step, I have to ensure that the input matrices are not only planar, but in column-major ordering.

mnicely · May 1, 2019, 4:38pm

Serge,

I was able to reproduce your error and believe this is a bug. I have submitted a ticket and will let you know what I hear.

Cliff,

I can’t say with confidence that that is a true statement. At this point, I can say either I am doing something incorrect or there’s an issue with the library. I will hopefully know more by next week.

serge.weinstock · May 1, 2019, 4:49pm

Thanks mnicely,

I’ve run also your example on a Ubuntu 18.10 gcloud VM with a Tesla P100. I get the same error.

cliff.burdick · May 1, 2019, 5:24pm

mnicely, for what it’s worth, I was able to get the correct results with a large, non-square CGEMM. I do not do the transformation steps in your code, but instead just pass it in pre-formatted and do a swap of GEMM parameters to make it row-major. This seems to work well, and the results are correct. Maybe there’s a bug in the transformation function?

lcxywfe · June 12, 2019, 1:51pm

Looking at the programming guide in section Using the cuBLASLt API under subsection 3.2.1. Single Precision GEMM, you’ll see an example that is nearly a drop-in replacement for cublasSgemm. That being said, you can start with the CUDA example in <samples_location>7_CUDALibraries/simpleCUBLAS, you can replace the cublasSgemm call with the 3.2.1 example. See below…

Notice that you can’t run cublasSgemm without making a few type changes. For simplicity, workspace=nullptr and workspaceSize=0.

/*
 * Copyright 1993-2017 NVIDIA Corporation.  All rights reserved.
 *
 * NOTICE TO USER:
 *
 * This source code is subject to NVIDIA ownership rights under U.S. and
 * international Copyright laws.  Users and possessors of this source code
 * are hereby granted a nonexclusive, royalty-free license to use this code
 * in individual and commercial software.
 *
 * NVIDIA MAKES NO REPRESENTATION ABOUT THE SUITABILITY OF THIS SOURCE
 * CODE FOR ANY PURPOSE.  IT IS PROVIDED "AS IS" WITHOUT EXPRESS OR
 * IMPLIED WARRANTY OF ANY KIND.  NVIDIA DISCLAIMS ALL WARRANTIES WITH
 * REGARD TO THIS SOURCE CODE, INCLUDING ALL IMPLIED WARRANTIES OF
 * MERCHANTABILITY, NONINFRINGEMENT, AND FITNESS FOR A PARTICULAR PURPOSE.
 * IN NO EVENT SHALL NVIDIA BE LIABLE FOR ANY SPECIAL, INDIRECT, INCIDENTAL,
 * OR CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS
 * OF USE, DATA OR PROFITS,  WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE
 * OR OTHER TORTIOUS ACTION,  ARISING OUT OF OR IN CONNECTION WITH THE USE
 * OR PERFORMANCE OF THIS SOURCE CODE.
 *
 * U.S. Government End Users.   This source code is a "commercial item" as
 * that term is defined at  48 C.F.R. 2.101 (OCT 1995), consisting  of
 * "commercial computer  software"  and "commercial computer software
 * documentation" as such terms are  used in 48 C.F.R. 12.212 (SEPT 1995)
 * and is provided to the U.S. Government only as a commercial end item.
 * Consistent with 48 C.F.R.12.212 and 48 C.F.R. 227.7202-1 through
 * 227.7202-4 (JUNE 1995), all U.S. Government End Users acquire the
 * source code with only those rights set forth herein.
 *
 * Any use of this source code in individual and commercial software must
 * include, in the user documentation and internal comments to the code,
 * the above Disclaimer and U.S. Government End Users Notice.
 */

/* This example demonstrates how to use the CUBLAS library
 * by scaling an array of floating-point values on the device
 * and comparing the result to the same operation performed
 * on the host.
 */

/* Includes, system */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* Includes, cuda */
#include <cublasLt.h>
#include <cublas_v2.h>
#include <cuda_runtime.h>
#include "helper_cuda.h"

/* Matrix size */
#define N (4096)

cublasStatus_t
LtSgemm(cublasLtHandle_t ltHandle,
       cublasOperation_t transa,
       cublasOperation_t transb,
       int m,
       int n,
       int k,
       const float *alpha, /* host pointer */
       const float *A,
       int lda,
       const float *B,
       int ldb,
       const float *beta, /* host pointer */
       float *C,
       int ldc,
       void *workspace,
       size_t workspaceSize) {
   cublasStatus_t status = CUBLAS_STATUS_SUCCESS;

   cublasLtMatmulDesc_t operationDesc = NULL;
   cublasLtMatrixLayout_t Adesc = NULL, Bdesc = NULL, Cdesc = NULL;
   cublasLtMatmulPreference_t preference = NULL;

   int returnedResults                             = 0;
   cublasLtMatmulHeuristicResult_t heuristicResult = {};

   // Create operation descriptor; see cublasLtMatmulDescAttributes_t
   // for details about defaults; here we just set the transforms for
   // A and B.
   status = cublasLtMatmulDescCreate(&operationDesc, CUDA_R_32F);
   if (status != CUBLAS_STATUS_SUCCESS) goto CLEANUP;
   status = cublasLtMatmulDescSetAttribute(operationDesc, CUBLASLT_MATMUL_DESC_TRANSA, &transa, sizeof(transa));
   if (status != CUBLAS_STATUS_SUCCESS) goto CLEANUP;
   status = cublasLtMatmulDescSetAttribute(operationDesc, CUBLASLT_MATMUL_DESC_TRANSB, &transb, sizeof(transa));
   if (status != CUBLAS_STATUS_SUCCESS) goto CLEANUP;

   // Create matrix descriptors. Not setting any extra attributes.
   status = cublasLtMatrixLayoutCreate(
       &Adesc, CUDA_R_32F, transa == CUBLAS_OP_N ? m : k, transa == CUBLAS_OP_N ? k : m, lda);
   if (status != CUBLAS_STATUS_SUCCESS) goto CLEANUP;
   status = cublasLtMatrixLayoutCreate(
       &Bdesc, CUDA_R_32F, transb == CUBLAS_OP_N ? k : n, transb == CUBLAS_OP_N ? n : k, ldb);
   if (status != CUBLAS_STATUS_SUCCESS) goto CLEANUP;
   status = cublasLtMatrixLayoutCreate(&Cdesc, CUDA_R_32F, m, n, ldc);
   if (status != CUBLAS_STATUS_SUCCESS) goto CLEANUP;

   // Create preference handle; In general, extra attributes can be
   // used here to disable tensor ops or to make sure algo selected
   // will work with badly aligned A, B, C. However, for simplicity
   // here we assume A,B,C are always well aligned (e.g., directly
   // come from cudaMalloc)
   status = cublasLtMatmulPreferenceCreate(&preference);
   if (status != CUBLAS_STATUS_SUCCESS) goto CLEANUP;
   status = cublasLtMatmulPreferenceSetAttribute(
       preference, CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES, &workspaceSize, sizeof(workspaceSize));
   if (status != CUBLAS_STATUS_SUCCESS) goto CLEANUP;

   // We just need the best available heuristic to try and run matmul.
   // There is no guarantee that this will work. For example, if A is
   // badly aligned, you can request more (e.g. 32) algos and try to
   // run them one by one until something works.
   status = cublasLtMatmulAlgoGetHeuristic(
       ltHandle, operationDesc, Adesc, Bdesc, Cdesc, Cdesc, preference, 1, &heuristicResult, &returnedResults);
   if (status != CUBLAS_STATUS_SUCCESS) goto CLEANUP;

   if (returnedResults == 0) {
       status = CUBLAS_STATUS_NOT_SUPPORTED;
       goto CLEANUP;
   }

   status = cublasLtMatmul(ltHandle,
                           operationDesc,
                           alpha,
                           A,
                           Adesc,
                           B,
                           Bdesc,
                           beta,
                           C,
                           Cdesc,
                           C,
                           Cdesc,
                           &heuristicResult.algo,
                           workspace,
                           workspaceSize,
                           0);

CLEANUP:
   // Descriptors are no longer needed as all GPU work was already
   // enqueued.
   if (preference) cublasLtMatmulPreferenceDestroy(preference);
   if (Cdesc) cublasLtMatrixLayoutDestroy(Cdesc);
   if (Bdesc) cublasLtMatrixLayoutDestroy(Bdesc);
   if (Adesc) cublasLtMatrixLayoutDestroy(Adesc);
   if (operationDesc) cublasLtMatmulDescDestroy(operationDesc);
   return status == CUBLAS_STATUS_SUCCESS ? static_cast<cublasStatus_t>(0) : static_cast<cublasStatus_t>(1);
}

/* Host implementation of a simple version of sgemm */
static void simple_sgemm(int n, float alpha, const float *A, const float *B,
                         float beta, float *C) {
  int i;
  int j;
  int k;

  for (i = 0; i < n; ++i) {
    for (j = 0; j < n; ++j) {
      float prod = 0;

      for (k = 0; k < n; ++k) {
        prod += A[k * n + i] * B[j * n + k];
      }

      C[j * n + i] = alpha * prod + beta * C[j * n + i];
    }
  }
}

/* Main */
int main(int argc, char **argv) {
  cublasStatus_t status;
  float *h_A;
  float *h_B;
  float *h_C;
  float *h_C_ref;
  float *d_A = 0;
  float *d_B = 0;
  float *d_C = 0;
  float alpha = 1.0f;
  float beta = 0.0f;
  int n2 = N * N;
  int i;
  float error_norm;
  float ref_norm;
  float diff;
  cublasLtHandle_t handle;

  int dev = findCudaDevice(argc, (const char **)argv);

  if (dev == -1) {
    return EXIT_FAILURE;
  }

  /* Initialize CUBLAS */
  printf("simpleCUBLAS test running..\n");

  status = cublasLtCreate(&handle);

  if (status != CUBLAS_STATUS_SUCCESS) {
    fprintf(stderr, "!!!! CUBLAS initialization error\n");
    return EXIT_FAILURE;
  }

  /* Allocate host memory for the matrices */
  h_A = reinterpret_cast<float *>(malloc(n2 * sizeof(h_A[0])));

  if (h_A == 0) {
    fprintf(stderr, "!!!! host memory allocation error (A)\n");
    return EXIT_FAILURE;
  }

  h_B = reinterpret_cast<float *>(malloc(n2 * sizeof(h_B[0])));

  if (h_B == 0) {
    fprintf(stderr, "!!!! host memory allocation error (B)\n");
    return EXIT_FAILURE;
  }

  h_C = reinterpret_cast<float *>(malloc(n2 * sizeof(h_C[0])));

  if (h_C == 0) {
    fprintf(stderr, "!!!! host memory allocation error (C)\n");
    return EXIT_FAILURE;
  }

  /* Fill the matrices with test data */
  for (i = 0; i < n2; i++) {
    h_A[i] = rand() / static_cast<float>(RAND_MAX);
    h_B[i] = rand() / static_cast<float>(RAND_MAX);
    h_C[i] = rand() / static_cast<float>(RAND_MAX);
  }

  /* Allocate device memory for the matrices */
  if (cudaMalloc(reinterpret_cast<void **>(&d_A), n2 * sizeof(d_A[0])) !=
      cudaSuccess) {
    fprintf(stderr, "!!!! device memory allocation error (allocate A)\n");
    return EXIT_FAILURE;
  }

  if (cudaMalloc(reinterpret_cast<void **>(&d_B), n2 * sizeof(d_B[0])) !=
      cudaSuccess) {
    fprintf(stderr, "!!!! device memory allocation error (allocate B)\n");
    return EXIT_FAILURE;
  }

  if (cudaMalloc(reinterpret_cast<void **>(&d_C), n2 * sizeof(d_C[0])) !=
      cudaSuccess) {
    fprintf(stderr, "!!!! device memory allocation error (allocate C)\n");
    return EXIT_FAILURE;
  }

  /* Initialize the device matrices with the host matrices */
  status = cublasSetVector(n2, sizeof(h_A[0]), h_A, 1, d_A, 1);

  if (status != CUBLAS_STATUS_SUCCESS) {
    fprintf(stderr, "!!!! device access error (write A)\n");
    return EXIT_FAILURE;
  }

  status = cublasSetVector(n2, sizeof(h_B[0]), h_B, 1, d_B, 1);

  if (status != CUBLAS_STATUS_SUCCESS) {
    fprintf(stderr, "!!!! device access error (write B)\n");
    return EXIT_FAILURE;
  }

  status = cublasSetVector(n2, sizeof(h_C[0]), h_C, 1, d_C, 1);

  if (status != CUBLAS_STATUS_SUCCESS) {
    fprintf(stderr, "!!!! device access error (write C)\n");
    return EXIT_FAILURE;
  }

  /* Performs operation using plain C code */
  simple_sgemm(N, alpha, h_A, h_B, beta, h_C);
  h_C_ref = h_C;


  // ******* REMOVE ********
  /* Performs operation using cublas */
//  status = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A,
//                       N, d_B, N, &beta, d_C, N);
  // ******* REMOVE ********

  status = LtSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, N, N, N, &alpha, d_A,
                         N, d_B, N, &beta, d_C, N, nullptr, 0);

  if (status != CUBLAS_STATUS_SUCCESS) {
    fprintf(stderr, "!!!! kernel execution error.\n");
    return EXIT_FAILURE;
  }

  /* Allocate host memory for reading back the result from device memory */
  h_C = reinterpret_cast<float *>(malloc(n2 * sizeof(h_C[0])));

  if (h_C == 0) {
    fprintf(stderr, "!!!! host memory allocation error (C)\n");
    return EXIT_FAILURE;
  }

  /* Read the result back */
  status = cublasGetVector(n2, sizeof(h_C[0]), d_C, 1, h_C, 1);

  if (status != CUBLAS_STATUS_SUCCESS) {
    fprintf(stderr, "!!!! device access error (read C)\n");
    return EXIT_FAILURE;
  }

  /* Check result against reference */
  error_norm = 0;
  ref_norm = 0;

  for (i = 0; i < n2; ++i) {
    diff = h_C_ref[i] - h_C[i];
    error_norm += diff * diff;
    ref_norm += h_C_ref[i] * h_C_ref[i];
  }

  error_norm = static_cast<float>(sqrt(static_cast<double>(error_norm)));
  ref_norm = static_cast<float>(sqrt(static_cast<double>(ref_norm)));

  if (fabs(ref_norm) < 1e-7) {
    fprintf(stderr, "!!!! reference norm is 0\n");
    return EXIT_FAILURE;
  }

  /* Memory clean up */
  free(h_A);
  free(h_B);
  free(h_C);
  free(h_C_ref);

  if (cudaFree(d_A) != cudaSuccess) {
    fprintf(stderr, "!!!! memory free error (A)\n");
    return EXIT_FAILURE;
  }

  if (cudaFree(d_B) != cudaSuccess) {
    fprintf(stderr, "!!!! memory free error (B)\n");
    return EXIT_FAILURE;
  }

  if (cudaFree(d_C) != cudaSuccess) {
    fprintf(stderr, "!!!! memory free error (C)\n");
    return EXIT_FAILURE;
  }

  /* Shutdown */
  status = cublasLtDestroy(handle);

  if (status != CUBLAS_STATUS_SUCCESS) {
    fprintf(stderr, "!!!! shutdown error (A)\n");
    return EXIT_FAILURE;
  }

  if (error_norm / ref_norm < 1e-3f) {
    printf("simpleCUBLAS test passed.\n");
    exit(EXIT_SUCCESS);
  } else {
    printf("simpleCUBLAS test failed.\n");
    exit(EXIT_FAILURE);
  }
}

I modified the N to 8, and get “kernel execution error” when invoked cublasLtMatmul, my GPU is 2080Ti.

serge.weinstock · August 2, 2019, 9:46am

Thanks for the example,

Unfortunately, I don’t think it will work in the case I’m interested in: I have already in GPU memory, two matrices which I want to multiply. These two matrices are strided and use row order storage

congliangx · December 23, 2019, 2:34am

mnicely:

Serge,

I have confirmed your code works on a GTX1080: CC 6.1. There is no reason to believe it will not work on a GTX1050. If you will humor me, checkout https://github.com/mnicely/cublasLt_examples/tree/master/cublasLt_sgemm and in the Release folder run
make clean; make; ./cublasLt_sgemm
Confirm that it works with row-major and column-major.

Cliff,

I have confirmed, on square matrices, that although complex, half is running with row-major order the answer is the same as column-major order, which is incorrect. I will file a ticket and let you know when I hear back from the developers.

hi，mnicely， the github code has no makefile , the readme tells to use Eclipse with NsightEE plugin to build, is there any code that has makefile and can run like cudasamples?

serge.weinstock · December 23, 2019, 6:11am

Thanks for the code sample.

It works fine for square matrices but it doesn’t work for non-square matrices.

in the example, changing:

calculate( i, i, i );

to:

calculate( i, i, i+1 );

leads to the error:

CUDA error at cublasLt_sgemm.cu:172 code=7(CUBLAScublasLt_STATUS_INVALID_VALUE) 
"cublasLtMatmulAlgoGetHeuristic( ltHandle, operationDesc, Adesc, Bdesc, Cdesc, Cdesc, preference, 1, &heuristicResult, &returnedResults )"

I’ve read the example code and I think it’s perfectly fine.

I’ve also done the same test using column-order format and in this case, the modified test works.

mnicely · December 24, 2019, 8:03pm

@congliangx,

I’ve restructured the cublasLt_examples repository and added a makefile.

@serge,

I’m not sure what the issue is a the moment. I’ll have to submit a ticket and get back to you.

mnicely · December 27, 2019, 9:44pm

@serge

The issue with row-major and non-squares is a known issue and a fix will be delivered in a future release.

serge.weinstock · December 28, 2019, 7:50am

@mnicely Thanks for reporting the issue

romain.laneuville · March 6, 2020, 3:15pm

Hello, is the issue for non square row-major matrices fixed ?

I have the same requirement as Serge, I have some already GPU allocated large matrices in row major format.

Copying matrices to a column major format would lead to a loss of performance I must avoid.

Thanks.

mnicely · March 6, 2020, 3:24pm

I’ll double check but the fix should be in the next release.

romain.laneuville · March 9, 2020, 4:20pm

Hello,

I’ve just finished to refactor my program to use cublasLt lib and I fall into a CUBLAS_STATUS_INVALID_VALUE when executing cublasLtMatmulAlgoGetHeuristic at line 248.

I trimmed a lot of code to write a minimal reproducible example with the same source code as I have in my program, I don’t know if I made a mistake or if it is related to the issue you mentioned before in this thread.

I’m running on Ubuntu 18.04 with RTX 5000 GPU.

Here is the source code.

#include <iostream>
#include <iomanip>
#include <limits>
#include <vector>
#include <cxxabi.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <cublasLt.h>

// ****************************************************************************************************************** //
//                                                    ErrorsCheck.cuh                                                 //
// ****************************************************************************************************************** //

static const char* cublasGetErrorEnum(cublasStatus_t error)
{
    switch (error)
    {
        case CUBLAS_STATUS_SUCCESS:
            return "CUBLAS_STATUS_SUCCESS";

        case CUBLAS_STATUS_NOT_INITIALIZED:
            return "CUBLAS_STATUS_NOT_INITIALIZED";

        case CUBLAS_STATUS_ALLOC_FAILED:
            return "CUBLAS_STATUS_ALLOC_FAILED";

        case CUBLAS_STATUS_INVALID_VALUE:
            return "CUBLAS_STATUS_INVALID_VALUE";

        case CUBLAS_STATUS_ARCH_MISMATCH:
            return "CUBLAS_STATUS_ARCH_MISMATCH";

        case CUBLAS_STATUS_MAPPING_ERROR:
            return "CUBLAS_STATUS_MAPPING_ERROR";

        case CUBLAS_STATUS_EXECUTION_FAILED:
            return "CUBLAS_STATUS_EXECUTION_FAILED";

        case CUBLAS_STATUS_INTERNAL_ERROR:
            return "CUBLAS_STATUS_INTERNAL_ERROR";

        case CUBLAS_STATUS_NOT_SUPPORTED:
            return "CUBLAS_STATUS_NOT_SUPPORTED";

        case CUBLAS_STATUS_LICENSE_ERROR:
            return "CUBLAS_STATUS_LICENSE_ERROR";

        default:
            return "<unknown>";
    }
}

inline void cublasLtCheck(cublasStatus_t status, int iLine, const char *szFile) {
    if (status != CUBLAS_STATUS_SUCCESS) {
        std::cerr << "CublasLt error " << cublasGetErrorEnum(status) << " at line " << iLine << " in file "
                  << szFile << std::endl;
    }
}

inline void cudaCheck(cudaError_t status, int iLine, const char *szFile) {
    if (status != cudaSuccess) {
        std::cerr << "CublasLt error " << cudaGetErrorString(status) << " at line " << iLine << " in file "
                  << szFile << std::endl;
    }
}

#define cublasLtCk(call) cublasLtCheck(call, __LINE__, __FILE__)
#define cudaCk(call) cudaCheck(call, __LINE__, __FILE__)

// ****************************************************************************************************************** //
//                                                    CudaMatrix.cuh                                                  //
// ****************************************************************************************************************** //

#define MB 1048576 // 2^19 byte

typedef unsigned int uint;

template <typename precision>
struct CudaMatrix {
    // Matrix multiplication GPU workspace that can be used to improve matrix multiplication computation time
    const static void   *matMulWorkspace;
    const static size_t matMulWorkspaceSize;

    CudaMatrix() : width(0), height(0), data(nullptr), cublasHandle(nullptr), cublasLtHandle(nullptr), matrixLayout(nullptr) { };
    CudaMatrix(uint width, uint height, cublasHandle_t cublasHandle = nullptr, cublasLtHandle_t cublasLtHandle = nullptr,
               cublasLtMatrixLayout_t matrixLayout = nullptr) : width(width), height(height), cublasHandle(cublasHandle),
               cublasLtHandle(cublasLtHandle), matrixLayout(matrixLayout)
    {
        cudaCk(cudaMalloc(&data, bytesSize()));

        if (typeid(precision).hash_code() == typeid(uint).hash_code()) {
            cublasLtDataType = CUDA_R_8U;
        } else if (typeid(precision).hash_code() == typeid(int).hash_code()) {
            cublasLtDataType = CUDA_R_8I;
        } else if (typeid(precision).hash_code() == typeid(float).hash_code()) {
            cublasLtDataType = CUDA_R_32F;
        } else if (typeid(precision).hash_code() == typeid(double).hash_code()) {
            cublasLtDataType = CUDA_R_64F;
        } else {
            throw std::runtime_error("The datatype " + std::string(typeid(precision).name()) + " is not handled in CudaMatrix");
        }

        cublasLtCk(cublasLtMatrixLayoutCreate(&matrixLayout, cublasLtDataType, height, width, width));

        if  (matMulWorkspace == nullptr) {
            cudaCk(cudaMalloc(&matMulWorkspace, matMulWorkspaceSize));
        }
    }

    __device__ __host__ uint size() const { return width * height; }

    static void product(const CudaMatrix &A, const CudaMatrix &B, CudaMatrix &C, cublasOperation_t opA, cublasOperation_t opB, cublasLtHandle_t lightHandle);

    void freeResources() { cudaCk(cudaFree(data)); cublasLtCk(cublasLtMatrixLayoutDestroy(matrixLayout)); }
    uint bytesSize() const { return size() * sizeof(precision); }
    void setValuesFromVector(const std::vector<precision> &vector);
    void setValuesFromVector(const std::vector<std::vector<precision>> &vectors);
    void display(const std::string &name = "", uint x = 0, uint y = 0, uint roiWidth = 0, uint roiHeight = 0) const;
    void product(const CudaMatrix &A) { product(*this, A, *this, CUBLAS_OP_N, CUBLAS_OP_N, cublasLtHandle); }

    precision              *data;
    uint                   width,
                           height;
    cublasHandle_t         cublasHandle;
    cublasLtHandle_t       cublasLtHandle;
    cublasLtMatrixLayout_t matrixLayout;
    cudaDataType_t         cublasLtDataType;
};

template <typename precision> const size_t CudaMatrix<precision>::matMulWorkspaceSize = 500 * MB;
template <typename precision> const void*  CudaMatrix<precision>::matMulWorkspace     = nullptr;

// ****************************************************************************************************************** //
//                                                     CudaMatrix.cu                                                  //
// ****************************************************************************************************************** //

/**
 * Display the matrix
 *
 * @tparam precision - The matrix precision
 *
 * @param name - The matrix name
 */
template <typename precision>
void CudaMatrix<precision>::display(const std::string &name, uint x, uint y, uint roiWidth, uint roiHeight) const
{
    precision *hostValues;

    roiWidth == 0 ? roiWidth = width : roiWidth = roiWidth;
    roiHeight == 0 ? roiHeight = height : roiHeight = roiHeight;

    cudaCk(cudaMallocHost(&hostValues, bytesSize()));
    cudaCk(cudaMemcpy(hostValues, data, bytesSize(), cudaMemcpyDeviceToHost));

    std::cout << std::setprecision(std::numeric_limits<precision>::digits10 + 1);

    std::cout << "Matrix " << name << " " << width << " x " << height << " pixels of "
              << abi::__cxa_demangle(typeid(precision).name(), nullptr, nullptr, nullptr)
              << "\n\n";

    for (int i = y; i < y + roiHeight; ++i) {
        std::cout << "{ ";

        for (int j = x; j < x + roiWidth - 1; ++j) {
            std::cout << *(hostValues + i * width + j) << ", ";
        }

        std::cout << *(hostValues + (i + 1) * width - 1) << " }\n";
    }

    std::cout << std::endl;

    cudaCk(cudaFreeHost(hostValues));
}

/**
 * Set the matrix values in device CUDA memory from a host standard 1D vector
 *
 * @tparam precision - The matrix precision
 *
 * @param vector - The values to set the device CUDA memory from
 */
template <typename precision>
void CudaMatrix<precision>::setValuesFromVector(const std::vector<precision> &vector)
{
    cudaCk(cudaMemcpy(data, vector.data(), vector.size() * sizeof(precision), cudaMemcpyHostToDevice));
}

/**
 * Set the matrix values in device CUDA memory from a host standard 2D vector
 *
 * @tparam precision - The matrix precision
 *
 * @param vectors - The values to set the device CUDA memory from
 */
template <typename precision>
void CudaMatrix<precision>::setValuesFromVector(const std::vector<std::vector<precision>> &vectors)
{
    std::vector<precision> buffer;

    buffer.reserve(vectors.size() * vectors[0].size());

    for (const auto &vector : vectors) {
        buffer.insert(buffer.end(), vector.begin(), vector.end());
    }

    setValuesFromVector(buffer);
}

/**
 * Performs the matrix-matrix multiplication C = A x B
 *
 * @see https://docs.nvidia.com/cuda/cublas/index.html#cublasLtMatmul
 *
 * @param A - The left matrix A
 * @param B - The right matrix B
 * @param C - The result matrix C
 * @param opA - Operation to perform on matrix A before multiplication (none, transpose or hermitian)
 * @param opB - Operation to perform on matrix B before multiplication (none, transpose or hermitian)
 * @param lightHandle - cublasLt handle
 */
template<typename precision>
void CudaMatrix<precision>::product(const CudaMatrix           &A,
                                    const CudaMatrix           &B,
                                          CudaMatrix           &C,
                                          cublasOperation_t    opA,
                                          cublasOperation_t    opB,
                                          cublasLtHandle_t     lightHandle
) {
    const precision                 zero               = 0,
                                    one                = 1;
    const int                       requestedAlgoCount = 1;
    cudaStream_t                    stream             = nullptr;
    cublasLtMatmulHeuristicResult_t heuristicResult;
    cublasLtMatmulPreference_t      preference;
    cublasLtMatmulDesc_t            computeDesc;
    int                             returnedAlgoCount;

    // Set matrix pre-operation such as transpose if any
    cublasLtCk(cublasLtMatmulDescCreate(&computeDesc, A.cublasLtDataType));
    cublasLtCk(cublasLtMatmulDescSetAttribute(computeDesc, CUBLASLT_MATMUL_DESC_TRANSA, &opA, sizeof(opA)));
    cublasLtCk(cublasLtMatmulDescSetAttribute(computeDesc, CUBLASLT_MATMUL_DESC_TRANSB, &opB, sizeof(opB)));

    // Get the best algorithm to use
    cublasLtCk(cublasLtMatmulPreferenceCreate(&preference));
    cublasLtCk(cublasLtMatmulPreferenceSetAttribute(preference, CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES,
               &CudaMatrix::matMulWorkspaceSize, sizeof(CudaMatrix::matMulWorkspaceSize)));
    cublasLtCk(cublasLtMatmulAlgoGetHeuristic(lightHandle, computeDesc, A.matrixLayout, B.matrixLayout,
               C.matrixLayout, C.matrixLayout, preference, requestedAlgoCount, &heuristicResult, &returnedAlgoCount));

    std::cout << "returnedAlgoCount = " << returnedAlgoCount << std::endl;

    // Do the multiplication
    cublasLtCk(cublasLtMatmul(lightHandle, computeDesc, &one, A.data, A.matrixLayout, B.data, B.matrixLayout, &zero,
               C.data, C.matrixLayout, C.data, C.matrixLayout, &heuristicResult.algo,
               &CudaMatrix::matMulWorkspace, CudaMatrix::matMulWorkspaceSize, stream));

    // clean up
    cublasLtCk(cublasLtMatmulPreferenceDestroy(preference));
    cublasLtCk(cublasLtMatmulDescDestroy(computeDesc));
}

// Forward template declarations
template struct CudaMatrix<double>;
template struct CudaMatrix<float>;
template struct CudaMatrix<int>;
template struct CudaMatrix<uint>;

// ****************************************************************************************************************** //
//                                                        main.cu                                                     //
// ****************************************************************************************************************** //

int main(int argc, char const *argv[])
{
	cublasLtHandle_t   cublasLtHandle = nullptr;
    std::vector<float> r1Expect       = { 6, 6, 6, 15, 15, 15, 24, 24, 24 };
    std::vector<float> r2Expect       = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };

    cublasLtCk(cublasLtCreate(&cublasLtHandle));

    // Declare matrices
    CudaMatrix<float> m1(3, 3);
    CudaMatrix<float> m2(3, 3);
    CudaMatrix<float> m3(3, 3);
    CudaMatrix<float> deviceResult(3, 3);

    // Set device memory values
    m1.setValuesFromVector({ {1, 1, 1}, {1, 1, 1}, {1, 1, 1} });
    m2.setValuesFromVector({ {1, 2, 3}, {4, 5, 6}, {7, 8, 9} });
    m3.setValuesFromVector({ {1, 0, 0}, {0, 1, 0}, {0, 0, 1} });

    // Test results (just showing it here)
    CudaMatrix<float>::product(m1, m2, deviceResult, CUBLAS_OP_N, CUBLAS_OP_N, cublasLtHandle);

    m1.display("m1");
    m2.display("m2");
    deviceResult.display("m1 X m2");

    CudaMatrix<float>::product(m2, m3, deviceResult, CUBLAS_OP_N, CUBLAS_OP_N, cublasLtHandle);

    m1.display("m2");
    m2.display("m3");
    deviceResult.display("m2 X m3");

    // Clean up
    cublasLtCk(cublasLtDestroy(cublasLtHandle));

    m1.freeResources();
    m2.freeResources();
    m3.freeResources();
    deviceResult.freeResources();

	return 0;
}

And if you need a CMakeLists file here it is

cmake_minimum_required(VERSION 3.10)
project(test-cuda)

# ------------------------------------------------ Compilation options ----------------------------------------------- #

# CUDA 10 does not support C++ 17
set(CMAKE_CXX_STANDARD 14)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++14")
set(CMAKE_BUILD_TYPE Debug) # Release or Debug

# Include CUDA
find_package(CUDA REQUIRED)
set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -arch=sm_75 -std=c++14 --expt-relaxed-constexpr --expt-extended-lambda")

# ----------------------------------------------------- Constants ---------------------------------------------------- #

if (NOT ${CMAKE_BUILD_TYPE} STREQUAL "Release")
    MESSAGE(STATUS "Debug build")
    add_definitions(-DDEBUG_CUDA)
else ()
    MESSAGE(STATUS "Release build")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3")
    set(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -O3")
endif ()

# ------------------------------------------------- Source code files ------------------------------------------------ #

# All in one
file(GLOB matmul "cublaslt_mat_mul.cu")

# ---------------------------------------------------- Executables --------------------------------------------------- #

cuda_add_executable(matmulTest ${matmul})

# ---------------------------------------------------- Libraries ----------------------------------------------------- #

# Path to local libraries
file(GLOB CUDAlibs "/usr/lib/x86_64-linux-gnu/libcuda.so" "/usr/lib/x86_64-linux-gnu/libcublas.so" "/usr/lib/x86_64-linux-gnu/libcublasLt.so" "/usr/local/cuda/lib64/libcudart.so")
# Link libraries
target_link_libraries(matmulTest ${CUDAlibs})

romain.laneuville · March 16, 2020, 12:02pm

I did 2 mistakes

The matrixLayout was not properly set, I wrote a function to write it before each multiplication based on the op applied to the matrix.

Additionally I put the matrix memory row major instead of column major.

Now the code is working well for square and non square product and row major memory.

cublaslt_mat_mul.cu

#include <iostream>
#include <iomanip>
#include <limits>
#include <vector>
#include <cxxabi.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <cublasLt.h>

// ****************************************************************************************************************** //
//                                                    ErrorsCheck.cuh                                                 //
// ****************************************************************************************************************** //

static const char* cublasGetErrorEnum(cublasStatus_t error)
{
    switch (error)
    {
        case CUBLAS_STATUS_SUCCESS:
            return "CUBLAS_STATUS_SUCCESS";

        case CUBLAS_STATUS_NOT_INITIALIZED:
            return "CUBLAS_STATUS_NOT_INITIALIZED";

        case CUBLAS_STATUS_ALLOC_FAILED:
            return "CUBLAS_STATUS_ALLOC_FAILED";

        case CUBLAS_STATUS_INVALID_VALUE:
            return "CUBLAS_STATUS_INVALID_VALUE";

        case CUBLAS_STATUS_ARCH_MISMATCH:
            return "CUBLAS_STATUS_ARCH_MISMATCH";

        case CUBLAS_STATUS_MAPPING_ERROR:
            return "CUBLAS_STATUS_MAPPING_ERROR";

        case CUBLAS_STATUS_EXECUTION_FAILED:
            return "CUBLAS_STATUS_EXECUTION_FAILED";

        case CUBLAS_STATUS_INTERNAL_ERROR:
            return "CUBLAS_STATUS_INTERNAL_ERROR";

        case CUBLAS_STATUS_NOT_SUPPORTED:
            return "CUBLAS_STATUS_NOT_SUPPORTED";

        case CUBLAS_STATUS_LICENSE_ERROR:
            return "CUBLAS_STATUS_LICENSE_ERROR";

        default:
            return "<unknown>";
    }
}

inline void cublasLtCheck(cublasStatus_t status, int iLine, const char *szFile) {
    if (status != CUBLAS_STATUS_SUCCESS) {
        std::cerr << "CublasLt error " << cublasGetErrorEnum(status) << " at line " << iLine << " in file "
                  << szFile << std::endl;
    }
}

inline void cudaCheck(cudaError_t status, int iLine, const char *szFile) {
    if (status != cudaSuccess) {
        std::cerr << "CublasLt error " << cudaGetErrorString(status) << " at line " << iLine << " in file "
                  << szFile << std::endl;
    }
}

#define cublasLtCk(call) cublasLtCheck(call, __LINE__, __FILE__)
#define cudaCk(call) cudaCheck(call, __LINE__, __FILE__)

// ****************************************************************************************************************** //
//                                                    CudaMatrix.cuh                                                  //
// ****************************************************************************************************************** //

#define MB 1048576 // 2^19 byte

typedef unsigned int uint;

template <typename precision>
struct CudaMatrix {
    // Matrix multiplication GPU workspace that can be used to improve matrix multiplication computation time
    const static void   *matMulWorkspace;
    const static size_t matMulWorkspaceSize;

    CudaMatrix() : width(0), height(0), data(nullptr), cublasHandle(nullptr), cublasLtHandle(nullptr), matrixLayout(nullptr) { };
    CudaMatrix(uint width, uint height, cublasHandle_t cublasHandle = nullptr, cublasLtHandle_t cublasLtHandle = nullptr,
               cublasLtMatrixLayout_t matrixLayout = nullptr) : width(width), height(height), cublasHandle(cublasHandle),
               cublasLtHandle(cublasLtHandle), matrixLayout(matrixLayout)
    {
        cudaCk(cudaMalloc(&data, bytesSize()));

        if (typeid(precision).hash_code() == typeid(uint).hash_code()) {
            cublasLtDataType = CUDA_R_8U;
        } else if (typeid(precision).hash_code() == typeid(int).hash_code()) {
            cublasLtDataType = CUDA_R_8I;
        } else if (typeid(precision).hash_code() == typeid(float).hash_code()) {
            cublasLtDataType = CUDA_R_32F;
        } else if (typeid(precision).hash_code() == typeid(double).hash_code()) {
            cublasLtDataType = CUDA_R_64F;
        } else {
            throw std::runtime_error("The datatype " + std::string(typeid(precision).name()) + " is not handled in CudaMatrix");
        }

        if  (matMulWorkspace == nullptr) {
            cudaCk(cudaMalloc(&matMulWorkspace, matMulWorkspaceSize));
        }
    }

    __device__ __host__ uint size() const { return width * height; }

    static void product(CudaMatrix &A, CudaMatrix &B, CudaMatrix &C, cublasOperation_t opA, cublasOperation_t opB, cublasLtHandle_t lightHandle);

    void freeResources() { cudaCk(cudaFree(data)); cublasLtCk(cublasLtMatrixLayoutDestroy(matrixLayout)); }
    void setMatrixLayout(cublasOperation_t op, cublasLtOrder_t matrixOrder = CUBLASLT_ORDER_ROW);
    uint bytesSize() const { return size() * sizeof(precision); }
    void setValuesFromVector(const std::vector<precision> &vector);
    void setValuesFromVector(const std::vector<std::vector<precision>> &vectors);
    void display(const std::string &name = "", uint x = 0, uint y = 0, uint roiWidth = 0, uint roiHeight = 0) const;
    void product(CudaMatrix &A) { product(*this, A, *this, CUBLAS_OP_N, CUBLAS_OP_N, cublasLtHandle); }

    precision              *data;
    uint                   width,
                           height;
    cublasHandle_t         cublasHandle;
    cublasLtHandle_t       cublasLtHandle;
    cublasLtMatrixLayout_t matrixLayout;
    cudaDataType_t         cublasLtDataType;
};

template <typename precision> const size_t CudaMatrix<precision>::matMulWorkspaceSize = 500 * MB;
template <typename precision> const void*  CudaMatrix<precision>::matMulWorkspace     = nullptr;

// ****************************************************************************************************************** //
//                                                     CudaMatrix.cu                                                  //
// ****************************************************************************************************************** //

/**
 * Display the matrix
 *
 * @tparam precision - The matrix precision
 *
 * @param name - The matrix name
 */
template <typename precision>
void CudaMatrix<precision>::display(const std::string &name, uint x, uint y, uint roiWidth, uint roiHeight) const
{
    precision *hostValues;

    roiWidth == 0 ? roiWidth = width : roiWidth = roiWidth;
    roiHeight == 0 ? roiHeight = height : roiHeight = roiHeight;

    cudaCk(cudaMallocHost(&hostValues, bytesSize()));
    cudaCk(cudaMemcpy(hostValues, data, bytesSize(), cudaMemcpyDeviceToHost));

    std::cout << std::setprecision(std::numeric_limits<precision>::digits10 + 1);

    std::cout << "Matrix " << name << " " << width << " x " << height << " pixels of "
              << abi::__cxa_demangle(typeid(precision).name(), nullptr, nullptr, nullptr)
              << "\n\n";

    for (int i = y; i < y + roiHeight; ++i) {
        std::cout << "{ ";

        for (int j = x; j < x + roiWidth - 1; ++j) {
            std::cout << *(hostValues + i * width + j) << ", ";
        }

        std::cout << *(hostValues + (i + 1) * width - 1) << " }\n";
    }

    std::cout << std::endl;

    cudaCk(cudaFreeHost(hostValues));
}

/**
 * Set the matrix values in device CUDA memory from a host standard 1D vector
 *
 * @tparam precision - The matrix precision
 *
 * @param vector - The values to set the device CUDA memory from
 */
template <typename precision>
void CudaMatrix<precision>::setValuesFromVector(const std::vector<precision> &vector)
{
    cudaCk(cudaMemcpy(data, vector.data(), vector.size() * sizeof(precision), cudaMemcpyHostToDevice));
}

/**
 * Set the matrix values in device CUDA memory from a host standard 2D vector
 *
 * @tparam precision - The matrix precision
 *
 * @param vectors - The values to set the device CUDA memory from
 */
template <typename precision>
void CudaMatrix<precision>::setValuesFromVector(const std::vector<std::vector<precision>> &vectors)
{
    std::vector<precision> buffer;

    buffer.reserve(vectors.size() * vectors[0].size());

    for (const auto &vector : vectors) {
        buffer.insert(buffer.end(), vector.begin(), vector.end());
    }

    setValuesFromVector(buffer);
}

/**
 * Set the matrix layout before matrix multiplication with row major memory by default
 *
 * @tparam precision - The matrix precision
 *
 * @param op - Operation to perform on matrix before multiplication (none, transpose or hermitian)
 * @param matrixOrder - The matrix memory order (column or row DEFAULT row)
 */
template<typename precision>
void CudaMatrix<precision>:: setMatrixLayout(cublasOperation_t op, cublasLtOrder_t matrixOrder)
{
    const uint m = (op == CUBLAS_OP_N ? height : width),
               n = (op == CUBLAS_OP_N ? width : height);

    cublasLtCk(cublasLtMatrixLayoutCreate(&matrixLayout, cublasLtDataType, m, n, height));
    cublasLtCk(cublasLtMatrixLayoutSetAttribute(matrixLayout, CUBLASLT_MATRIX_LAYOUT_ORDER, &matrixOrder, sizeof(matrixOrder)));
}

/**
 * Performs the matrix-matrix multiplication C = A x B
 *
 * @see https://docs.nvidia.com/cuda/cublas/index.html#cublasLtMatmul
 *
 * @param A - The left matrix A
 * @param B - The right matrix B
 * @param C - The result matrix C
 * @param opA - Operation to perform on matrix A before multiplication (none, transpose or hermitian)
 * @param opB - Operation to perform on matrix B before multiplication (none, transpose or hermitian)
 * @param lightHandle - cublasLt handle
 */
template<typename precision>
void CudaMatrix<precision>::product(CudaMatrix           &A,
                                    CudaMatrix           &B,
                                    CudaMatrix           &C,
                                    cublasOperation_t    opA,
                                    cublasOperation_t    opB,
                                    cublasLtHandle_t     lightHandle
) {
    const precision                 zero               = 0,
                                    one                = 1;
    const int                       requestedAlgoCount = 1;
    cudaStream_t                    stream             = nullptr;
    cublasLtMatmulHeuristicResult_t heuristicResult;
    cublasLtMatmulPreference_t      preference;
    cublasLtMatmulDesc_t            computeDesc;
    int                             returnedAlgoCount;

    // Set matrix pre-operation such as transpose if any
    cublasLtCk(cublasLtMatmulDescCreate(&computeDesc, A.cublasLtDataType));
    cublasLtCk(cublasLtMatmulDescSetAttribute(computeDesc, CUBLASLT_MATMUL_DESC_TRANSA, &opA, sizeof(opA)));
    cublasLtCk(cublasLtMatmulDescSetAttribute(computeDesc, CUBLASLT_MATMUL_DESC_TRANSB, &opB, sizeof(opB)));

    // Set matrices layout
    A.setMatrixLayout(opA);
    B.setMatrixLayout(opB);
    C.setMatrixLayout(CUBLAS_OP_N);

    // Get the best algorithm to use
    cublasLtCk(cublasLtMatmulPreferenceCreate(&preference));
    cublasLtCk(cublasLtMatmulPreferenceSetAttribute(preference, CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES,
               &CudaMatrix::matMulWorkspaceSize, sizeof(CudaMatrix::matMulWorkspaceSize)));
    cublasLtCk(cublasLtMatmulAlgoGetHeuristic(lightHandle, computeDesc, A.matrixLayout, B.matrixLayout,
               C.matrixLayout, C.matrixLayout, preference, requestedAlgoCount, &heuristicResult, &returnedAlgoCount));

    // Do the multiplication
    cublasLtCk(cublasLtMatmul(lightHandle, computeDesc, &one, A.data, A.matrixLayout, B.data, B.matrixLayout, &zero,
               C.data, C.matrixLayout, C.data, C.matrixLayout, &heuristicResult.algo,
               &CudaMatrix::matMulWorkspace, CudaMatrix::matMulWorkspaceSize, stream));

    // clean up
    cublasLtCk(cublasLtMatmulPreferenceDestroy(preference));
    cublasLtCk(cublasLtMatmulDescDestroy(computeDesc));
}

// Forward template declarations
template struct CudaMatrix<double>;
template struct CudaMatrix<float>;
template struct CudaMatrix<int>;
template struct CudaMatrix<uint>;

// ****************************************************************************************************************** //
//                                                        main.cu                                                     //
// ****************************************************************************************************************** //

int main(int argc, char const *argv[])
{
	cublasLtHandle_t   cublasLtHandle = nullptr;
    std::vector<float> r1Expect       = { 6, 6, 6, 15, 15, 15, 24, 24, 24 };
    std::vector<float> r2Expect       = { 1, 2, 3, 4, 5, 6, 7, 8, 9 };

    cublasLtCk(cublasLtCreate(&cublasLtHandle));

    // Declare matrices
    CudaMatrix<float> m1(3, 3);
    CudaMatrix<float> m2(3, 3);
    CudaMatrix<float> m3(3, 3);
    CudaMatrix<float> m4(3, 2);
    CudaMatrix<float> m5(2, 3);
    CudaMatrix<float> deviceResult_2_2(2, 2);
    CudaMatrix<float> deviceResult_3_3(3, 3);

    // Set device memory values
    m1.setValuesFromVector({ {1, 1, 1}, {1, 1, 1}, {1, 1, 1} });
    m2.setValuesFromVector({ {1, 2, 3}, {4, 5, 6}, {7, 8, 9} });
    m3.setValuesFromVector({ {1, 0, 0}, {0, 1, 0}, {0, 0, 1} });
    m4.setValuesFromVector({ {1, 2, 3}, {4, 5, 6} });
    m5.setValuesFromVector({ {1, 2}, { 3, 4 }, { 5 , 6 } });

    // Test results (just showing it here)
    CudaMatrix<float>::product(m1, m2, deviceResult_3_3, CUBLAS_OP_N, CUBLAS_OP_N, cublasLtHandle);

    deviceResult_3_3.display("m1 X m2");

    CudaMatrix<float>::product(m2, m3, deviceResult_3_3, CUBLAS_OP_N, CUBLAS_OP_N, cublasLtHandle);

    deviceResult_3_3.display("m2 X m3");

    CudaMatrix<float>::product(m4, m5, deviceResult_3_3, CUBLAS_OP_N, CUBLAS_OP_N, cublasLtHandle);

    deviceResult_3_3.display("m4 X m5");

    CudaMatrix<float>::product(m5, m4, deviceResult_2_2, CUBLAS_OP_N, CUBLAS_OP_N, cublasLtHandle);

    deviceResult_2_2.display("m5 X m4");

    // Clean up
    cublasLtCk(cublasLtDestroy(cublasLtHandle));

    m1.freeResources();
    m2.freeResources();
    m3.freeResources();
    deviceResult_2_2.freeResources();
    deviceResult_3_3.freeResources();

	return 0;
}