Bug in gpuGetMaxGflopsDeviceId for CUDA Toolkit 10.0

The function gpuGetMaxGflopsDeviceId in the CUDA Toolkit 10.0 helper_cuda.h has a problem.

It does NOT pick the fastest device like the CUDA Toolkit 9.2 does.

Tracing through this in Toolkit 10.0 shows the following:

inline int gpuGetMaxGflopsDeviceId()

DEVICE 0
deviceProp = {name=0x00000089f117ede0 “Quadro P2000” uuid={bytes=0x00000089f117eee0 ":Ä}\x1d¥þ¶vÌÀÚÿ—Õõ… } luid=…}
compute_perf = 1516032000
sm_per_multiproc = 128
deviceProp.multiProcessorCount = 8
deviceProp.clockRate = 1480500

DEVICE 1
deviceProp = {name=0x00000089f117ede0 “GeForce GTX 690” uuid={bytes=0x00000089f117eee0 "új¤\x5\\tpÕˆgˆqÊk-\x6… } …}
compute_perf = 1565952000
sm_per_multiproc = 192
deviceProp.multiProcessorCount = 8
deviceProp.clockRate = 1019500

DEVICE 2
deviceProp = {name=0x000000fb4dcfeb50 “GeForce GTX 690” uuid={bytes=0x000000fb4dcfec50 "ráS­vœXƒ\x1c"ÇHCd2Î… } …}
compute_perf = 1565952000
sm_per_multiproc = 192
deviceProp.multiProcessorCount = 8
deviceProp.clockRate = 1019500

max_perf_device = 1

In CUDA Toolkit 9.2 it picked Device 0 which is correct.
In CUDA Toolkit 10.0 it picked Device 1 which is incorrect. Firstly, Device 1 is NOT in TCC mode and Device 0 is. Secondly, the compute performance for a GeForce GTX 690 is NOT the same as a Quadro P2000.

As a result, by default the samples will run on the wrong GPU.

Apologies, duplicate post

I believe helper_cuda.h contains convenience functions for the private use of the CUDA samples which do not represent a supported API.

That said, the data reported for the Quadro P2000 in the log does not seem correct. I have a Quadro P2000 here and see about 3470 GFLOPS single precision, 108.4 GFLOPS double precision, at an operating clock of 1721 MHz reported by nvidia-smi. From the data you show it seems the helper function assumed or measured (?) a lower operating clock of 1480 MHz. That would scale performance down to 2984 GFLOPS single precision.

According to Wikipedia, a GTX 690 has two GPUs with 2811 GFLOPS single precision each at 915 MHz. However, the log above shows that the function uses 1019 MHz. Performance would therefore scale to 3130 GFLOPS single precision per GPU, slightly higher than the Quadro P2000.

So based on the frequencies used by the helper function, a GTX 690 would beat the Quadro P2000 by a tiny bit in single-precision GFLOPS, and therefore the GTX 690 should be selected, as the function just considers theoretical single-precision GLFOPS, not overall performance. The two parts are close in their respective GFLOPS rate, and it depends on the exact operating clocks which one comes out ahead.

Consider filing a bug with NVIDIA. If I am correct in my assumption that this is not a supported API function, the priority to address issues in it will likely be low.

[Later:]

Looking carefully at all the clocks listed by nvidia-smi for my Quadro P2000, I don’t see 1480 MHz or anything near it listed anywhere. Nor can I find that frequency listed in any specifications I searched for on the internet. Assuming that clock rate is actually measured somewhere, it would seem like an unusually low clock rate for this GPU. Under significant load (~ 67W), my Quadro P2000 heats up to more than 80 degrees, the speed fan goes to 99%, but clock rate never drops below 1600 MHz.

You may want to check whether you have applied an application clock setting to your Quadro P2000, or whether it is cooled insufficiently (e.g. blocked airflow or heated air from other components at inflow).

Hmmm, you have me concerned now. The card is well cooled in a big loosely packed tower with plenty of fans.

This is what I see from nvidia-smi
±----------------------------------------------------------------------------+
| NVIDIA-SMI 411.31 Driver Version: 411.31 |
|-------------------------------±---------------------±---------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 690 WDDM | 00000000:05:00.0 N/A | N/A |
| 30% 35C P8 N/A / N/A | 576MiB / 2048MiB | N/A Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 690 WDDM | 00000000:06:00.0 N/A | N/A |
| 30% 33C P8 N/A / N/A | 576MiB / 2048MiB | N/A Default |
±------------------------------±---------------------±---------------------+
| 2 Quadro P2000 TCC | 00000000:07:00.0 Off | N/A |
| 44% 30C P8 4W / 75W | 0MiB / 5050MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
±----------------------------------------------------------------------------+

C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi --query-gpu=gpu_name,gpu_bus_id,vbios_version --format=csv
name, pci.bus_id, vbios_version
GeForce GTX 690, 00000000:05:00.0, 80.04.1E.00.90
GeForce GTX 690, 00000000:06:00.0, 80.04.1E.00.91
Quadro P2000, 00000000:07:00.0, 86.06.3F.00.30

For the P2000 (nvidia-smi -q -d CLOCK)

GPU 00000000:07:00.0
Clocks
Graphics : 139 MHz
SM : 139 MHz
Memory : 405 MHz
Video : 544 MHz
Applications Clocks
Graphics : 1075 MHz
Memory : 3504 MHz
Default Applications Clocks
Graphics : 1075 MHz
Memory : 3504 MHz
Max Clocks
Graphics : 1721 MHz
SM : 1721 MHz
Memory : 3504 MHz
Video : 1556 MHz
Max Customer Boost Clocks
Graphics : 1721 MHz
SM Clock Samples
Duration : 544.20 sec
Number of Samples : 100
Max : 1721 MHz
Min : 139 MHz
Avg : 780 MHz
Memory Clock Samples
Duration : 545.63 sec
Number of Samples : 100
Max : 3504 MHz
Min : 405 MHz
Avg : 1902 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A

Can we do a comparison please? What nvidia-smi command should I be looking at?

Your data shows that the Quadro P2000 is reaching the maximum clock of 1721 at least some times, so I see nothing to be concerned about.

I don’t know where the clock rate used by gpuGetMaxGflopsDeviceId() is coming from. I doubt it is based on a database of GPUs, because looking at the number of NVIDIA entries in Folding@Home GPU whitelist, that database would have to be pretty substantial. So my assumption is the function determines an actual clock rate somehow, either by measuring it or by retrieving it from NVIDIA management library (which serves as the basis for nvidia-smi).

GPU clocks are very dynamic with modern clocking regimes, so it seems possible and plausible that gpuGetMaxGflopsDeviceId() may grab the clock rate at an “inopportune” time, leading it to determine a slower clock of 1480 MHz that hardly ever occurs when running actual CUDA applications.

Since clocking regimes can change from driver version to driver version, a different clock value may be observed under different CUDA versions. The typical clock frequency trajectory I would expect a GPU to show under an applied non-varying CUDA load is that the clocks start at some safe default when the CUDA application starts, then they get ramped up in many small steps during the first 30 seconds (or so) of operation, and that they may decline again depending on GPU temperature and other factors.

While running a particular Folding@Home workload (project 11710), I get the following output from nvidia-smi on my Quadro P2000 (power draw 65W, Fan 97%, Temp 82 C):

The duplicate post is here:
[url]https://devtalk.nvidia.com/default/topic/1042098/bug-in-gpugetmaxgflopsdeviceid-for-cuda-toolkit-10-0/#5285926[/url]

I pointed to a previous Njuffa’s reply on the subject some time ago.

Thanks saulocpp. I did read that one. While I take your point that helper_cuda is for convenience, it does internally call API functions to determine performance and without accurate performance, the reason to use a GPU is somewhat mute :-)

Maybe you can make your own measurement function with this “zero copy” code (which I copied from somewhere here in devtalk):

#include <stdlib.h>
    #include <stdio.h>

    #define ZCOPY_THREADS  128
    #define ZCOPY_DEFLEN   10000000
    #define ZCOPY_ITER     10           // as in STREAM benchmark

    // Macro to catch CUDA errors in CUDA runtime calls
    #define CUDA_SAFE_CALL(call)                                          \
    do {                                                                  \
        cudaError_t err = call;                                           \
        if (cudaSuccess != err) {                                         \
            fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                     __FILE__, __LINE__, cudaGetErrorString(err) );       \
            exit(EXIT_FAILURE);                                           \
        }                                                                 \
    } while (0)

    // Macro to catch CUDA errors in kernel launches
    #define CHECK_LAUNCH_ERROR()                                          \
    do {                                                                  \
        /* Check synchronous errors, i.e. pre-launch */                   \
        cudaError_t err = cudaGetLastError();                             \
        if (cudaSuccess != err) {                                         \
            fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                     __FILE__, __LINE__, cudaGetErrorString(err) );       \
            exit(EXIT_FAILURE);                                           \
        }                                                                 \
        /* Check asynchronous errors, i.e. kernel failed (ULF) */         \
        err = cudaThreadSynchronize();                                    \
        if (cudaSuccess != err) {                                         \
            fprintf (stderr, "Cuda error in file '%s' in line %i : %s.\n",\
                     __FILE__, __LINE__, cudaGetErrorString( err) );      \
            exit(EXIT_FAILURE);                                           \
        }                                                                 \
    } while (0)

    // A routine to give access to a high precision timer on most systems.
    #if defined(_WIN32)
    #if !defined(WIN32_LEAN_AND_MEAN)
    #define WIN32_LEAN_AND_MEAN
    #endif
    #include <windows.h>
    double second (void)
    {
        LARGE_INTEGER t;
        static double oofreq;
        static int checkedForHighResTimer;
        static BOOL hasHighResTimer;

        if (!checkedForHighResTimer) {
            hasHighResTimer = QueryPerformanceFrequency (&t);
            oofreq = 1.0 / (double)t.QuadPart;
            checkedForHighResTimer = 1;
        }
        if (hasHighResTimer) {
            QueryPerformanceCounter (&t);
            return (double)t.QuadPart * oofreq;
        } else {
            return (double)GetTickCount() * 1.0e-3;
        }
    }
    #elif defined(__linux__) || defined(__APPLE__)
    #include <stddef.h>
    #include <sys/time.h>
    double second (void)
    {
        struct timeval tv;
        gettimeofday(&tv, NULL);
        return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
    }
    #else
    #error unsupported platform
    #endif

    __global__ void zcopy (const double2 * __restrict__ src, 
                           double2 * __restrict__ dst, int len)
    {
        int stride = gridDim.x * blockDim.x;
        int tid = blockDim.x * blockIdx.x + threadIdx.x;
        for (int i = tid; i < len; i += stride) {
            dst[i] = src[i];
        }
    }    

    struct zcopyOpts {
        int len;
    };

    static int processArgs (int argc, char *argv[], struct zcopyOpts *opts)
    {
        int error = 0;
        memset (opts, 0, sizeof(*opts));
        while (argc) {
            if (*argv[0] == '-') {
                switch (*(argv[0]+1)) {
                case 'n':
                    opts->len = atol(argv[0]+2);
                    break;
                default:
                    fprintf (stderr, "Unknown switch '%c%s'\n", '-', argv[0]+1);
                    error++;
                    break;
                }
            }
            argc--;
            argv++;
        }
        return error;
    }

    int main (int argc, char *argv[])
    {
        double start, stop, elapsed, mintime;
        double2 *d_a, *d_b;
        int errors;
        struct zcopyOpts opts;

        errors = processArgs (argc, argv, &opts);
        if (errors) {
            return EXIT_FAILURE;
        }
        opts.len = (opts.len) ? opts.len : ZCOPY_DEFLEN;

        /* Allocate memory on device */
        CUDA_SAFE_CALL (cudaMalloc((void**)&d_a, sizeof(d_a[0]) * opts.len));
        CUDA_SAFE_CALL (cudaMalloc((void**)&d_b, sizeof(d_b[0]) * opts.len));
        
        /* Initialize device memory */
        CUDA_SAFE_CALL (cudaMemset(d_a, 0x00, sizeof(d_a[0]) * opts.len)); // zero
        CUDA_SAFE_CALL (cudaMemset(d_b, 0xff, sizeof(d_b[0]) * opts.len)); // NaN

        /* Compute execution configuration */
        dim3 dimBlock(ZCOPY_THREADS);
        int threadBlocks = (opts.len + (dimBlock.x - 1)) / dimBlock.x;
        if (threadBlocks > 65520) threadBlocks = 65520;
        dim3 dimGrid(threadBlocks);
        
        printf ("zcopy: operating on vectors of %d double2s (= %.3e bytes)\n", 
                opts.len, (double)sizeof(d_a[0]) * opts.len);
        printf ("zcopy: using %d threads per block, %d blocks\n", 
                dimBlock.x, dimGrid.x);

        mintime = fabs(log(0.0));
        for (int k = 0; k < ZCOPY_ITER; k++) {
            start = second();
            zcopy<<<dimGrid,dimBlock>>>(d_a, d_b, opts.len);
            CHECK_LAUNCH_ERROR();
            stop = second();
            elapsed = stop - start;
            if (elapsed < mintime) mintime = elapsed;
        }
        printf ("zcopy: mintime = %.3f msec  throughput = %.2f GB/sec\n",
                1.0e3 * mintime, (2.0e-9 * sizeof(d_a[0]) * opts.len) / mintime);

        CUDA_SAFE_CALL (cudaFree(d_a));
        CUDA_SAFE_CALL (cudaFree(d_b));

        return EXIT_SUCCESS;
    }

Then loop through the available devices, each running it. The one which has the biggest throughput is the one that will run the rest of your program.