CUDA execution multiples of 16ms

krazanmp · May 22, 2015, 8:20pm

I have a sequence of kernels that are executed hundreds to thousands of times, all of which operate on a single set of input data in order to produce a result. Everything functions fine and behaves as expected.

I am focusing on performance now and have noticed that the time difference between executing these kernels on each set of input data is always a difference of some multiple of 16ms +/- 1ms. This is of course very close to the refresh rate of my display devices.

I have changed the behavior of the kernels to modify their performance, and while the times change the differences are always very close to a multiple of 16ms.

The GPUs I am running on have no output device connected to them (I have a separate GPU strictly for Windows to display on my monitors). Our application does not use any graphics on the CUDA device and shouldn’t care about the display rate at all. I have gone through the NVIDIA control panel and disabled all of the graphics-specific settings for this device. In short, I am using the GPU strictly as a non-rendering coprocessor to compute a result and return it to the CPU.

Our application essentially just crunches numbers and produces a result. There is no need for it to sync to the display whatsoever.

Is there some setting I have missed to make this device not behave like it needs to sync with the display?
Is there potentially something else I haven't considered that might be causing this?

Robert_Crovella · May 22, 2015, 8:29pm

Are your GPUs in WDDM mode?

You may have better results under windows if you can use GPUs that can be placed into TCC mode. Having said that, I’m not sure there’s anything about WDDM that would enforce a 16ms cycle. There may be something else going on.

tera · May 23, 2015, 1:33pm

CUDA batches kernels on Windows under WDDM. Insert cudaStreamQuery(0) to force immediate execution of your kernel.

njuffa · May 24, 2015, 12:06pm

You don’t describe the details of your measurement methodology. The timing granularity observed may also be the result of choosing a low-precision host-based timer. If I recall correctly the default low-precision timer in Windows has a granularity that is very close to 16 ms. You would want to use a high-precision timing facility instead, such as gettimeofday() on Linux, which easily allows measurements with 1 usec resolution. Here is some code I have been using for many years, maybe it is helpful.

// A routine to give access to a high precision timer on most systems.
#if defined(_WIN32)
#if !defined(WIN32_LEAN_AND_MEAN)
#define WIN32_LEAN_AND_MEAN
#endif
#include <windows.h>
double second (void)
{
    LARGE_INTEGER t;
    static double oofreq;
    static int checkedForHighResTimer;
    static BOOL hasHighResTimer;

    if (!checkedForHighResTimer) {
        hasHighResTimer = QueryPerformanceFrequency (&t);
        oofreq = 1.0 / (double)t.QuadPart;
        checkedForHighResTimer = 1;
    }
    if (hasHighResTimer) {
        QueryPerformanceCounter (&t);
        return (double)t.QuadPart * oofreq;
    } else {
        return (double)GetTickCount() * 1.0e-3;
    }
}
#elif defined(__linux__) || defined(__APPLE__)
#include <stddef.h>
#include <sys/time.h>
double second (void)
{
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return (double)tv.tv_sec + (double)tv.tv_usec * 1.0e-6;
}
#else
#error unsupported platform
#endif

Robert_Crovella · May 24, 2015, 1:25pm

Glad you’re back njuffa. I was starting to hyperventilate.

njuffa · May 24, 2015, 1:47pm

It’s refreshing to be completely off the grid every once in a while :-)

krazanmp · May 25, 2015, 9:52pm

@njuffa Thanks for reminding me to double check my timers but yes I am using the same time routines shown in your example.

@tera I am not sure what cudaStreamQuery(0) is intended to do. According to the docs this function simply queries the stream (which in many APIs I have used can stall the pipeline). Does this call have a side-effect of flushing the stream or something? As an aside, I do eventually call cudaStreamQuery once all of the kernels have been inserted into the stream and the CPU thread is ready to read back the results.

njuffa · May 25, 2015, 11:27pm

When running CUDA with the WDDM driver on Windows, cudaStreamQuery(0) has the side effect of firing off the current accumulated batch of commands to the GPU. There are other CUDA API calls with the same side effect, cudaStreamQuery(0) is simply a safe, convenient, and low overhead one. The CUDA driver batches commands when used with the WDDM driver to mitigate the inherent cost of issuing commands to the GPU with that driver model. The batching reduces the average cost of issuing commands, but can cause other performance artifacts.

krazanmp · May 26, 2015, 12:06am

@jnuffa Interesting. Makes sense.

Consequently adding these calls does improve overall performance of my system quite a bit but it still exhibits the n*16ms +/- 1ms timing behaviour. In other words, adding cudaStreamQuery seems to decrease the value of n for many of the inputs but n but the time seems to be locked to (very nearly) 16ms increments only.

njuffa · May 26, 2015, 12:19am

I think at this point you would want to fire up the profiler to find out where the application time goes. If possible, consider splitting your state, e.g. according to a red-black scheme, to gain parallelism and the ability of using multiple streams.

While a “cycle time” of 16 ms may suggest some connection to display refresh rates, I believe this is a red herring based on the information you have provided in this thread.

If your GPU is capable of operating with the more efficient non-graphics TCC driver, I would suggest using that instead of the default WDDM driver, since your GPU is not used for display purposes in the first place.

CudaaduC · May 26, 2015, 12:29am

OP, have you tried running and timing the application from the command prompt using nvprof? That should show you the correct time like this which runs the same __shfl() based bitonic sort in the same stream (serial order) on 32 int2 values 16 times;

==6468== Profiling application: ConsoleApplication1.exe
==6468== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
283.90ms  1.6320us                    -               -         -         -         -      256B  156.86MB/s  GeForce GTX TIT         1         7  [CUDA memcpy HtoD]
283.90ms  4.4160us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [183]
283.96ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [188]
284.00ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [193]
284.05ms  2.9130us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [198]
284.10ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [203]
284.14ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [208]
284.19ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [213]
284.23ms  2.9440us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [218]
284.28ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [223]
284.33ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [228]
284.38ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [233]
284.42ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [238]
284.47ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [243]
284.52ms  2.8800us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [248]
284.56ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [253]
284.61ms  2.9120us              (1 1 1)        (32 1 1)        14        0B        0B         -           -  GeForce GTX TIT         1         7  b_sort(int2*, int) [258]
284.65ms  2.2720us                    -               -         -         -         -      256B  112.68MB/s  GeForce GTX TIT         1         7  [CUDA memcpy DtoH]

From the command line use nvprof --print-gpu-trace yourApplication.exe

krazanmp · May 27, 2015, 11:05pm

It turns out a CPU-side process was causing the 16ms timing behaviour.

I am surprised how much time was saved by calling cudaStreamQuery(0) though. While it should be expected that efficient flushing of any queue would improve the timing of any system, I am seeing 10-20% improvement in completion times. Considering that I was rearranging kernel memory usage patterns to get 5% improvements (time consuming) but saw better actual improvements by careful flushing of the stream.

njuffa · May 28, 2015, 3:19am

Thanks for closing the loop. Sounds like the 16 ms are related to the duration of the Windows time slice?

krazanmp · May 28, 2015, 10:29pm

Somebody was doing a Sleep(0) in an seemingly unrelated area of the code but affected the timing of the CPU threads.

KlausT · May 30, 2015, 9:10pm

If you are using Windows then maybe timeBeginPeriod() can increase the accuracy of your measurements.

Topic		Replies	Views
First kernel execution takes longer CUDA Programming and Performance	8	2846	December 8, 2014
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1719	July 19, 2022
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10571	June 21, 2009
Different times Ubuntu Vs Windows CUDA Programming and Performance	8	1675	October 12, 2015
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2340	January 9, 2017
Inconsistent kernel run times CUDA Programming and Performance	12	5782	August 5, 2009
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1844	January 12, 2019
Problem regarding data transfer overlap between multiple asynchronous streams CUDA Programming and Performance	8	799	September 11, 2016
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	738	December 18, 2017

CUDA execution multiples of 16ms

Related topics