MatrixMul sample

I was testing the sample MatrixMul on my laptop. The default dimension works fine but the following run, fails

E:\ThinkPad\Documents\Visual Studio 2017\bin\win64\Debug> .\matrixMul.exe -wA=500 -hA=500 -wB=500 -hB=500
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce 840M" with compute capability 5.0

MatrixA(500,500), MatrixB(500,500)
Computing result using CUDA Kernel...
done
Failed to synchronize on the stop event (error code unspecified launch failure)!

Is that related to out of memory? So, why it didn’t explicitly say that?

It’s not an out of memory issue.
you may be hitting a WDDM timeout, you should rule out that possibility first.

if you have modified or disabled the WDDM timeout appropriately, then I would suggest trying on CUDA 9.1 if you are not already using CUDA 9.1

I haven’t modified anything. Please let me know what should I check before downloading 9.1. I am not using 9.0.176

please google “cuda wddm tdr”

you’ll find all sorts of information. Personally I would choose to use the documented method within nsight VSE to modify the timeout for test purposes.

[url]http://docs.nvidia.com/nsight-visual-studio-edition/Nsight_Visual_Studio_Edition_User_Guide.htm#Timeout_Detection_Recovery.htm[/url]

So, mine was 2 instead of 10 (the default!) and I changed that to 20.
Now, I don’t see that error. Instead after some seconds, I see a lot of lines like

Error! Matrix[249991]=0.00000000, ref=5.00000000 error term is > 1.000000E-06

and finally it says

Result = FAIL

The TDR is enabled. The document says that for multiple gpus, it should be off. Is that related to that error?

In matrix multiplication, the error in each element of the result matrix tends to increase as matrix dimensions increase because each element is the result of an increasingly larger number of floating-point operations meaning more rounding errors (and errors from other source, such a subtractive cancellation) is accumulated.

I haven’t checked the example code, but suspect the error bounds used might be fixed and were chosen to be appropriate for the matrix dimensions chosen for the example. By simply increasing the matrix dimensions, you would then be violating the assumptions underlying these error bound.

Each of the CUDA example codes is typically designed to demonstrate one (and only one) specific design principle. They aren’t ready-to-ship industrial-strength pieces of code providing maximum flexibility and robustness. Code with those properties can be pretty complex and hard to understand, making it unsuitable for use as an example.

This is bizarre…
The default size is fine

E:\ThinkPad\Documents\Visual Studio 2017\bin\win64\Debug> .\matrixMul.exe
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce 840M" with compute capability 5.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 1.87 GFlop/s, Time= 69.964 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

But the 400x400 and 400x400 fails withe the error I posted in my previous post. The matrix sizes are are below

  1. A 320x320 => 102,400
    B 640x320 => 208,800
    TOTAL = 311,200

  2. A 400x400 => 160,000
    B 400x400 => 160,000
    TOTAL = 320,000

So, is that error really related to the error bounding due to the increased size?

P.S: I also get the same error with two 500x500 matrices and two 200x200 matrices!!!

So, since the number of errors was so large that the scroll history reached its max, I ran the following command to see the very first error messages

E:\ThinkPad\Documents\Visual Studio 2017\bin\win64\Debug> .\matrixMul.exe -wA=200 -hA=200 -wB=200 -hB=200 | more
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce 840M" with compute capability 5.0

MatrixA(200,200), MatrixB(200,200)
Computing result using CUDA Kernel...
done
Performance= 1.80 GFlop/s, Time= 8.899 msec, Size= 16000000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Error! Matrix[00000]=149474535093838061819284946944.00000000, ref=2.00000000 error term is > 1.000000E-06
Error! Matrix[00001]=149474535093838061819284946944.00000000, ref=2.00000000 error term is > 1.000000E-06
Error! Matrix[00002]=149474535093838061819284946944.00000000, ref=2.00000000 error term is > 1.000000E-06
....

It seems that the multiplication is done, but the verification fails!

It’s not just the size, but also the data assigned to the elements of the source matrices, as well as the shape of the source matrices that govern the overall per-element error.

My working hypothesis here is that the error bound in the app used was chosen to allow for a quick validation of the exact configuration chosen for the example and only tested for that case.

In general it is very difficult to predict the maximum per-element error in matrix multiplication. Each element of the result matrix is the result of a lengthy dot product that is typically accumulated piece-wise.

[Later:]
The example output you added later shows that results are not just inaccurate but widely off. I suggest examining whether your modifications triggered violation of some restrictive assumption(s) made by the example code. You could treat it as your chance to learn debugging a CUDA program based on a real-life example that you care about.

A noob question regarding what you said about the numbers that are used for test only.
I see some hard coded numbers in the code. Does that mean, for another size, I have to change them accordingly?
If that is correct, then why the application uses command line arguments?!

Possibly. I haven’t looked at the code for this app.

The app as-is may support a variety of argument values, but may restrict the maximum size of the matrices or may place restrictions on the dimensions, e.g. they have to be a multiple of 16 or 32. That is just a hypothesis, I haven’t looked at the app, nor have I tried it.

Since you are a self-described noob, I would claim that you will derive a much greater benefit if you analyze the app by yourself than if I were to analyze it and provide you with the result of that analysis. Working through it, with the CUDA documentation by your side for reference, will likely significantly increase your working knowledge of CUDA and/or matrix multiplication.

choose matrix sizes that are a multiple of 32

the code is a shared-memory tiled matrix multiplication, and for clarity only supports matrices which are whole-number multiples of the tile size. The default tile size is 32 (i.e. 32x32).

The code is similar to the example (shared-memory variant) given in the programming guide here:

[url]Programming Guide :: CUDA Toolkit Documentation

that example has the same limitation.

It’s possible of course to write a shared memory tiled matrix multiplication that handles arbitrary matrix dimensions, but the code is considerably complicated to handle the partial tiles. The CUDA samples are intended primarily for teaching/exposition of concepts, not as turn-key production ready algorithms. Therefore, for clarity, the partial tile handling is omitted.

Thanks. You are right. I was able to multiply 1024x640 and 640x1024.
Some other questions:
1- While the program is running and the GPU load is 100%, I see that the CPU utilization is 25% which means one of the four CPU cores is fully utilized. Is that normal? I doubt!

2- The wall clock time of the program is about 2 minutes for the sizes I mentioned. I also see with gpuz that the load is 100% for 2 minutes. However, the program reports

Performance= 1.90 GFlop/s, Time= 441.827 msec, Size= 838860800 Ops, WorkgroupSize= 1024 threads/block

What is that Time then?! Looking into the code, I see

int nIter = 300;
float msecTotal = 0.0f;
error = cudaEventElapsedTime(&msecTotal, start, stop);
float msecPerMatrixMul = msecTotal / nIter;

According to the manual, that cudaEventElapsedTime() measures the time between start and stop and that is kernel itself. So, why I get mili seconds from the program and see minutes on the wall clock?!

441.8 msec = 0.442 seconds per iteration. 300 iterations at 0.442 seconds each therefore take 132.6 seconds in total, or a bit over two minutes. BTW, that’s based solely on the data and code snippets you provided in #13.

The CPU load you are seeing is likely caused by busy-waiting when the use of synchronous CUDA API calls enforces synchronization between CPU and GPU activity.

You might also want to use this opportunity to familiarize yourself with the CUDA profiler.

With a large run, I get this error

.\matrixMul.exe -wA=4096 -hA=2048 -wB=2048 -hB=4096
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce 840M" with compute capability 5.0

MatrixA(4096,2048), MatrixB(2048,4096)
Computing result using CUDA Kernel...
done
Failed to synchronize on the stop event (error code unspecified launch failure)!

The error is not meaningful from the code itself

// Execute the kernel
    int nIter = 300;
    for (int j = 0; j < nIter; j++) {
        if (block_size == 16)   
            matrixMulCUDA<16><<< grid, threads >>>(d_C, d_A, d_B, dimsA.x, dimsB.x);
        else
            matrixMulCUDA<32><<< grid, threads >>>(d_C, d_A, d_B, dimsA.x, dimsB.x);
    }
    // Record the stop event
    error = cudaEventRecord(stop, NULL);
    if (error != cudaSuccess) {
        fprintf(stderr, "Failed to record stop event (error code %s)!\n", cudaGetErrorString(error));
        exit(EXIT_FAILURE);
    }
    // Wait for the stop event to complete
    error = cudaEventSynchronize(stop);
    if (error != cudaSuccess) {
        fprintf(stderr, "Failed to synchronize on the stop event (error code %s)!\n", cudaGetErrorString(error));
        exit(EXIT_FAILURE);
    }

What are the possibilities?

wddm tdr timeout

and the one core loading is probably normal. When a CPU thread is waiting for a CPU task to complete, for example at a synchronization like cudaDeviceSynchronize, the default sync behavior is a busy wait, i.e. the CPU thread is spinning there.