GTX 660 and Nano performance drop-off after sustained matrix multiplies

I modified the matrixMulCUBLAS sample code to go for longer than 30 iterations and to report elapsed host clock time. With CUDA 10.2 on a GTX 660 I found that after 300 iterations the effective iterations/second rate started to drop off, going from 100,000 iterations/second at 300 iterations down to 4,200 iterations/second by 4,000 iterations. I did a similar test on a Jetson Nano. It held consistent performance until about 900 iterations and then fell off a cliff after that. Throughout this, the CPU performances remain consistent, though much slower of course. I’ve also monitor CPU and GPU clock speeds and temperatures, which look solid throughout (no evidence of throttling). Is there some effect from sustained computation that kicks in for CUDA applications to give results like this?

cuda-benchmark

100,000 iterations per second is 10 microseconds per iteration. That is approximately the launch latency of a typical CUDA kernel. A matrix multiplication operation of any appreciable size is not going to be finished in 10 microseconds, so your data immediately becomes suspect in this respect.

My guess is you are experiencing a transition from an asynchronous launch queue being “not full” to being “full”. For some reason this topic has surfaced a number of times recently, here and here are recent discussions. This is just a guess, of course, as you have provided no code.

The “long term” rate is probably more reflective of what the GPU can actually sustain from a computational perspective - ~4000/sec for the 660 and ~1000/sec for the nano, corresponding to 250us and 1ms actual kernel durations.

You may want to learn to use a GPU profiler. My guess is that if you used a profiler, you would observe that every kernel duration is on the order of 250us on the 660, from first to last. You’re just witnessing the effects of a non-infinite asynchronous launch queue.

Thank you for your reply. As I mentioned, I’m using the matrixMulCUBLAS sample code that comes with the CUDA 10.2 installation. It includes a “warm-up” call to the kernel before starting the iteration, which I thought was to pre-load the kernel before the start of the timing test. Your observation about the asynchronous call does catch my attention though. I’ll add forced synchronization after the loop to make sure it’s really done before checking the finish time.

Thank you. The synchronization was the problem. I needed to put the check on the host clock finish time after the synchronization line in the sample code.

image

If I understand the table correctly, the throughput on the GPU implementation is on the order of 1000x to 2000x the throughput of the CPU implementation. Even considering a single-threaded non-SIMD CPU implementation, that seems questionable.

Am I misinterpreting the data? If not, what explains the huge performance discrepancy?

Those are the results (iteration count over delta-time) for a single stream with no sync or h2d/d2h transfers between each iteration. I imagine that if it had uploads and downloads for each iteration it would drop quite a bit and be more similar to the sort of overall performance improvement that folks see in useful applications. But as an upper bound in pure computational improvement, does that really seem too high versus a single cpu core?

Given that the GTX 660 was a middle-class Kepler-based GPU with a speed of light of only 2 FP32 TFLOPS I am having a hard time imagining how a factor 2350x performance difference to a x86-64 CPU would come about. A particularly slow CPU? A naive implementation of GEMM without cache-blocking?

For practical comparisons I would expect roughly comparable configurations: A middle-class GPU with a performance library like CUBLAS compared to a middle-class CPU of similar vintage with a performance library like MKL.

It’s fun having these discussion without code to look at.

I don’t really know the ways that OP modified the code, and english-text description leave me unsatisfied. Putting that aside, the matrixMulCUBLAS sample code uses a naive CPU matrix multiply for comparison.

Interesting. I admit I have never looked at this particular example code. I assume like other example codes provided by NVIDIA it comes with the caveat “not to be used for benchmarking purposes”?

Given that GPU-accelerated codes typically have a 2x-10x performance advantage over well optimized CPU codes that use SIMD and using multiple cores, I have always felt that doing performance comparison by setting up a strawman just to knock it down are doing a disservice to the further adoption of GPUs across industries. A form of misinformation basically that ultimately causes disappointments and backlash.

In any event, even without code to look at, a performance difference in excess of 2000x seemed just implausible, and if this were my project, the observation would cause me to think deeply and examine carefully what is going on.

yes

Furthermore, that particular unmodified sample code doesn’t print out any CPU performance measurement. The purpose of that routine is for numerical comparison/verification.

Again, don’t know exactly what OP did. It might be that the reported host results have nothing to do with matrix multiplication, I don’t know.

Here is the modified section of the matrixMulCUBLAS.cpp code. And yes, there is a disclaimer about using it for benchmarking. The example code Makefile has no obvious optimization flags so the host code may be unduly slow because of that?

// execute the kernel
int nIters[] = {1,30,100,300,400,500,600,900,1200,8000};
for (int i = 0; i < 9; i++)
{
	int nIter = nIters[i];
	printf( "nIter = %d\n", nIter);
	long long int gpu_nsec; 

	// CUBLAS version 2.0
	{
	    const float alpha = 1.0f;
	    const float beta  = 0.0f;
	    cublasHandle_t handle;
	    cudaEvent_t start, stop;

	    checkCudaErrors(cublasCreate(&handle));

	    //Perform warmup operation with cublas
	    {
			printf( "Warmup: ");
			std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
			checkCudaErrors(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiWB, matrix_size.uiHA, matrix_size.uiWA, &alpha, d_B, matrix_size.uiWB, d_A, matrix_size.uiWA, &beta, d_C, matrix_size.uiWB));
			std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
			std::cout << "Time difference = " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << "[µs]"  << " = " << std::chrono::duration_cast<std::chrono::nanoseconds> (end - begin).count() << "[ns]" << std::endl;
	    }
	    // Allocate CUDA events that we'll use for timing
	    checkCudaErrors(cudaEventCreate(&start));
	    checkCudaErrors(cudaEventCreate(&stop));

	    // Record the start event
	    checkCudaErrors(cudaEventRecord(start, NULL));
	    printf( "GPU   : ");
	    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
	    for (int j = 0; j < nIter; j++)
	    {
	        //note cublas is column primary!
	        //need to transpose the order
	        // *************************************************************************************
	        checkCudaErrors(cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N, matrix_size.uiWB, matrix_size.uiHA, matrix_size.uiWA, &alpha, d_B, matrix_size.uiWB, d_A, matrix_size.uiWA, &beta, d_C, matrix_size.uiWB));
	    }

	    // Record the stop event
	    checkCudaErrors(cudaEventRecord(stop, NULL));

	    // Wait for the stop event to complete
	    checkCudaErrors(cudaEventSynchronize(stop));

	     std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
	    gpu_nsec = std::chrono::duration_cast<std::chrono::nanoseconds> (end - begin).count();
	    std::cout << "Time difference = " << gpu_nsec / 1000 << "[µs]" << " = " << gpu_nsec << "[ns]" << std::endl;

	    float msecTotal = 0.0f;
	    checkCudaErrors(cudaEventElapsedTime(&msecTotal, start, stop));

	    // copy result from device to host
	    checkCudaErrors(cudaMemcpy(h_CUBLAS, d_C, mem_size_C, cudaMemcpyDeviceToHost));

	   // Destroy the handle
	    checkCudaErrors(cublasDestroy(handle));
	}

	// compute reference solution
	printf( "CPU   : ");
	float *reference = (float *)malloc(mem_size_C);
	std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
	for (int j = 0; j < nIter; j++)
	{
	    fprintf(stderr,  ".");
	    // *************************************************************************************
	    matrixMulCPU(reference, h_A, h_B, matrix_size.uiHA, matrix_size.uiWA, matrix_size.uiWB);
	}
	std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
	long long int cpu_nsec = std::chrono::duration_cast<std::chrono::nanoseconds> (end - begin).count();
	std::cout << "Time difference = " << cpu_nsec / 1000 << "[µs]" << " = " << cpu_nsec << "[ns]" << std::endl;

	// check result (CUBLAS)
	resCUBLAS = sdkCompareL2fe(reference, h_CUBLAS, size_C, 1.0e-6f);

	if (resCUBLAS != true)
	{
	    printDiff(reference, h_CUBLAS, matrix_size.uiWC, matrix_size.uiHC, 100, 1.0e-5f);
	}
	auto speed_up = (float) cpu_nsec / gpu_nsec;
	std::cout << "Speed-up is " << speed_up << std::endl;

	printf("Comparing CUBLAS Matrix Multiply with CPU results: %s\n", (true == resCUBLAS) ? "PASS" : "FAIL");
	    free( reference);

}

printf("\nNOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.\n");

And pre-summarized data:

Since this my first foray into GPU acceleration, I’m starting out with some basic assessments to understand what to expect. I appreciate that you are offering your insights.

And regarding the CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 21
Model: 48
Model name: AMD A10-7800 Radeon R7, 12 Compute Cores 4C+8G
Stepping: 1
Frequency boost: enabled
CPU MHz: 3698.709
CPU max MHz: 3500.0000
CPU min MHz: 1400.0000
BogoMIPS: 6987.08
Virtualization: AMD-V
L1d cache: 32 KiB
L1i cache: 192 KiB
L2 cache: 4 MiB

NVIDIA put this output in the sample app (and it is there for a reason):

NOTE: The CUDA Samples are not meant for performance measurements.

As Robert Crovella pointed out, the host-based reference implementation in the sample app computes the matrix product in the most naive way possible. In addition, looking at the code of matrixMulCPU() one notices that it accumulates each dot product as a double, although the matrices comprising float elements. It does this to ensure an accurate reference result. This drops performance even further.

On my Windows workstation, with CUBLAS 11.1 and a fully optimized released build I observe a performance difference of factor 1430x for the default case selected by the app, with CUBLAS taking 0.084 msec on a Quadro RTX 4000 and the CPU (Xeon W-2133) taking 120.1 msec.

Based on that, the performance numbers you are seeing are in the expected range, but the performance comparison is also entirely meaningless, comparing apples not to oranges but carrots.

The project will be to recode an application written with OpenCV to use CUDA, cross-compiled for Jetson Nano. This exercise as a first-step has served its purpose to demonstrate a comparison of C++ code and functionally matching CUDA code, though as you point out, the naïve C++ code is certainly not putting a best foot forward as a starting point for any comparison. I’m not terribly concerned about the absolute performance of this NVIDIA sample, as the real comparison will come from the rewritten application code. I’m presently getting about 15 fps for the desktop application and I expect the Jetson Nano CPU version to be worse. It seems attainable to get 60 fps on the Jetson Nano with CUDA.

If you have done a roofline analysis of your current code, you might be able to make a pretty accurate estimate of the performance of a GPU-accelerated implementation based on the “speeds and feeds” of the target platform.

Note that the performance characteristics of an optimized GEMM could be very different from an image processing task, in that the former is compute limited, while the letter may be (in part or entirely) limited by memory bandwidth. It depends on the compute intensity of the image processing.