comparing matmul performance with and without gpu

I am using the code matrixMul.cu provided in the NVIDIA Corporation/CUDA Samples/v8.0/0_Simple/ directory. The code has been compiled on my windows10 enterprise /64bit laptop which has a K2000M GPU. The tool I am using is MS Visual Studio Community 2015 v. 14.0.2543.01 Update 3.

The nvcc command line reads:

Driver API (NVCC Compilation Type is .cubin, .gpu, or .ptx)

set CUDAFE_FLAGS=–sdk_dir “C:\Program Files (x86)\Windows Kits\8.1”
“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe” --use-local-env --cl-version 2015 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64” -I./ -I…/…/common/inc --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -o x64/Release/%(Filename)%(Extension).obj “%(FullPath)”

Runtime API (NVCC Compilation Type is hybrid object or .c file)

set CUDAFE_FLAGS=–sdk_dir “C:\Program Files (x86)\Windows Kits\8.1”
“C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v8.0\bin\nvcc.exe” --use-local-env --cl-version 2015 -ccbin “C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\bin\x86_amd64” -I./ -I…/…/common/inc --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static -DWIN32 -Xcompiler "/EHsc /nologo /FS /Zi /MT " -o x64/Release/%(Filename)%(Extension).obj “%(FullPath)”

I then use debug/run without debugger Matrixmul release x64 and I get the following output:

[i]Matrix Multiply Using CUDA] - Starting…
GPU Device 0: “Quadro K2000M” with compute capability 3.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel…
done
Performance= 46.60 GFlop/s, Time= 2.813 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.[/i]

My questions are:

  1. [/can I compile and run the same code (how?) so that it runs on the CPU only, so that I can compare the performance .]
    • If this is not possible how do I modify the source code so that the gpu is not activated

No, it’s not possible simply by recompilation to run this code on the CPU. Furthermore, the code output itself suggests that it is not a good vehicle for performance comparison.

thanks for the reply. Where can I get a simple code that I can compile/run with and without gpu. Ideally I would need the plain c code and then the version including cuda instructions.

You should be able to write a small application that calls the DGEMM of a host-side library and cublasDgemm() of CUBLAS and compare the timing of each. It is possible that someone has already shared such code on the internet, Google is your friend.

I have been able to compile and run the following code calling dgemm, the performance I see is roughly 18 GFLOPS (double precision) depending on the matrix size. On a single precision version, I get ~ 45 GFLOPS which is approx. the same as in the matrixMul code that comes with the toolkit.

But I need to compare with a CPU-only version of the code: while cublas.lib is available in the cuda toolkit, I have not identified a blas library to be used for the test.
The CUDA code was created as follows:
#nvcc foo.cu cublas.lib

// System includes
#include <stdio.h>
#include <assert.h>

// CUDA runtime
#include <cuda_runtime.h>

// Helper functions and utilities to work with CUDA
//#include <helper_functions.h>
//#include <helper_cuda.h>

//#include "mycudautils.h"
#include "cublas.h"
#include <math.h>

#include <ctype.h>

#include <stdarg.h>

#include <stdlib.h>

#include <string.h>

#include <cstdlib>
#include <iostream>
#include <cmath>
//#include <stat.h>



#define TX 16


main() {

        int N;
 cudaEvent_t start;
 cudaEventCreate(&start);
 cudaEvent_t stop;
 cudaEventCreate(&stop);

//      for (N = 16; N <= 2048; N += 16) {
      N=1024;
unsigned int mem_size_C = N* N * sizeof(double);


                const float valB = 0.01f;

                int n;

                // fill host array
                double *h_A = (double*)calloc(N*N, sizeof(double));
                double *h_B = (double*)calloc(N*N, sizeof(double));
                double *h_C = (double*)calloc(N*N, sizeof(double));
                for (n = 0; n<N*N; ++n) {
                        h_A[n] = 1.0f;
                        h_B[n] = valB;
                }


                /* now do cublas version */
                double *cublas_A, *cublas_B, *cublas_C;
                cublasInit();

                cublasAlloc(N*N, sizeof(double), (void**)&cublas_A);
                cublasAlloc(N*N, sizeof(double), (void**)&cublas_B);
                cublasAlloc(N*N, sizeof(double), (void**)&cublas_C);

                cublasSetMatrix(N, N, sizeof(double), h_A, N, cublas_A, N);
                cublasSetMatrix(N, N, sizeof(double), h_B, N, cublas_B, N);

                double alpha = 1.f, beta = 0.f;
     cudaEventRecord(start,NULL);
//              StartKernelTiming(tic, toc, 0);
                for (int loop = 0; loop<100; ++loop)
                        cublasDgemm('N', 'N', N, N, N,
                                alpha, cublas_A, N,
                                cublas_B, N,
                                beta, cublas_C, N);

//              StopKernelTiming(tic, toc, 0, &elapsed_time);


        cudaEventRecord(stop, NULL);
        cudaEventSynchronize(stop);
        float msecTotal = 0.0f;
        cudaEventElapsedTime(&msecTotal, start, stop);
        float msecPerMatrixMul = msecTotal / 100;
                /* convert from miliseconds to seconds */
    double flopsPerMatrixMul = 2.0 * N * N * N;
    double gigaFlops = (flopsPerMatrixMul * 1.0e-9f) / (msecPerMatrixMul / 1000.0f);
    printf(
        "Performance= %.2f GFlop/s, \n",
        gigaFlops
        );

     cudaMemcpy(h_C, cublas_C, mem_size_C, cudaMemcpyDeviceToHost);

    bool correct = true;

    // test relative error by the formula
    //     |<x, y>_cpu - <x,y>_gpu|/<|x|, |y|>  < eps
    double eps = 1.e-6 ; // machine zero

      for (int i = 0; i < (N * N); i++)
//    for (int i = 1900; i < (2000); i++)
    {
        double abs_err = fabs(h_C[i] - (N * valB));
        double dot_length = N;
        double abs_val = fabs(h_C[i]);
        double rel_err = abs_err/abs_val/dot_length ;

        if (rel_err > eps)
        {
            printf("Error! Matrix[%05d]=%.8f, ref=%.8f error term is > %E\n", i, h_C[i], N*valB, eps);
            correct = false;
        }
    }

    printf("%s\n", correct ? "Result = PASS" : "Result = FAIL");

                free(h_A);
                free(h_B);
                free(h_C);
                cublasFree(cublas_A);
                cublasFree(cublas_B);
                cublasFree(cublas_C);

        cublasShutdown();
}

Try AMD ACML or Intel MKL.

Why are you running the basic slow version of matrix multiplication when there are much better implementations available? Even in the CUDA SDK there is the cuBLAS version.

Even on my 2 year old laptop with a gtx 980M I get over 1 Teraflop;

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v7.5\bin\win64\Release>matrixMulCUBLAS.exe
[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "GeForce GTX 980M" with compute capability 5.2

MatrixA(640,480), MatrixB(480,320), MatrixC(640,320)
Computing result using CUBLAS...done.
Performance= <b>1136.52 GFlop/s</b>, Time= 0.173 msec, Size= 196608000 Ops
Computing result using host CPU...done.
Comparing CUBLAS Matrix Multiply with CPU results: PASS

And the cuBLAS Sgemm is not the fastest version of Sgemm, as Scott Gray has been able to get over 8 Teraflops using the Maxwell Titan X and probably well over that with Pascal.