Doing precise timing tests on my system display device (or perils thereof)

michaelrrolle45 · July 25, 2019, 7:54am

I have a low end GPU which I bought for use as my system (Windows) display device. I’m using it for CUDA development work, for what I hope will be a high performing application. Before I invest in a more powerful GPU and look for development partners, I’d like to have some confidence that my code is capable of running at near 100% of the compute capacity of the ultimate target device.
If any of you think this is a fool’s errand, please tell me.
Otherwise, I want to get some reasonable timings of the kernel code on my present device. The kernel is predominantly compute bound and does little communication with the host. Ideally, the clock() interval should be the exactly the same for the same code most of the time.
However, when the times get up to 200 or more cycles, the actual interval samples are so far spread out that I cannot tell how long the code actually takes when nothing is interfering with it.
My device has only a single Kepler SMX, and it is being used for all the system display functions. I would like to reserve half of the SMX for my kernel, which is 96 of the CUDA cores and two warp schedulers. If the second warp scheduler can be kept idle, and the first one running nothing but my kernel, then supposedly it should run the same speed all (or almost all) of the time. Then my clock intervals would be a lot more useful to me.
My device does indeed support priorities, with levels 0 and -1. Do you think this would help my kernel get more processing cycles. And what functions could I call to control the warp schedulers? Is it possible to write a second kernel for the second scheduler that will not interfere with instruction dispatch by the first scheduler – if so, then I might assign both of these kernels to the same pair of schedulers and reserve them.

Robert_Crovella · July 25, 2019, 1:24pm

You don’t have the ability to individually control warp schedulers, or split up a SMX the way you are suggesting. Priorities don’t allow you to arbitrate between CUDA and graphics.

If you want to reserve a device for CUDA, and in general my recommendation for CUDA work, I would switch to linux.

tera · July 25, 2019, 9:35pm

I would recommend getting a more powerful device dedicated to CUDA as soon as possible, and devote your existing card to driving the display.
My experience when I started with CUDA a while ago was that the scaling did not work out as I anticipated, and I had to rework large parts of my code.
CUDA has become more forgiving to non-optimal code since, and easier to use in general. So my experience may not directly translate to contemporary CUDA versions and GPUs, where even low-end devices are much more powerful now. But I’d believe it still makes a difference, and timing code on a WDDM device also driving the display is so hard I wouldn’t even try.

On Windows you’d want a card you can put into TCC mode (Tesla, or apparently (some?) Titans which you’ll now have to buy second hand).

Topic		Replies	Views
CUDA execution multiples of 16ms CUDA Programming and Performance	14	2272	May 30, 2015
Scheduling Warps of different kernels in the same cycle on the same SM CUDA Programming and Performance	6	497	December 6, 2024
Prioritization of GPU time between CUDA and DirectX CUDA Programming and Performance cuda	2	755	April 29, 2023
Irregularity in the timings A few statistics CUDA Programming and Performance	14	10889	October 24, 2007
GPU architecture and warp scheduling CUDA Programming and Performance	10	21091	February 10, 2018
timing the kernel CUDA Programming and Performance	1	5719	June 18, 2008
Why is my single thread GPU speed 1000x faster than my CPU? CUDA Programming and Performance	14	5169	January 9, 2017
Questions of CUDA stream priority CUDA Programming and Performance cuda	10	4643	April 19, 2023
Clock() and Clock64() Functions CUDA Programming and Performance cuda	10	2658	March 13, 2024
How disruptive is graphics to the performance of CUDA code? CUDA Programming and Performance	3	1048	December 14, 2012

Doing precise timing tests on my system display device (or perils thereof)

Related topics