why 2.9 seconds to start tesla K20

wlangdon · March 3, 2018, 3:31pm

I have a CUDA program originally written for a GTX 745 (compute level 5.0)
I recompiled all the CUDA code for Tesla K20
(nvcc release 9.1, V9.1.8 -gencode arch=compute_35,code=sm_35 )

Although once it gets going the K20 is much faster but
it routinely takes almost three seconds to start CUDA.

Mostly I am using network drives so I tried copying all inputs and
the linux image itself to a local disk but that make almost no difference.

nvprof says the (initial!) call to cudaFree takes up to 585.01ms

As always any help would be most welcome
Bill

njuffa · March 3, 2018, 5:01pm

Am I correct to assume that the system with the GTX 745 and the system with the K20 are not physically the same system? If so, am I correct to assume that the system with the K20 has a much larger system memory?

It seems you have already addressed some potential sources of startup overhead, such as JIT compilation due to GPU architecture mismatch and slow access to any files involved.

[1] Make sure that driver persistency is used, i.e. the driver stays resident even if CUDA is not in use. The modern approach is to use the persistence daemon: [url]Driver Persistence :: GPU Deployment and Management Documentation

[2] Startup times can be significant on systems with large amounts of memory (that is the total of all host memory + all GPU memory), all of which needs to be mapped into a unified memory map. To my knowledge, the work required is linear in the amount of total memory that needs to be mapped; beyond that execution speed depends on the performance of the host system (single-thread CPU, system memory).

[3] I assume the cudaFree() call is a cudaFree(0) at the start of your application, designed to absorb the cost of CUDA context creation, as context creation is implicit and triggered by the first CUDA API call. The 581 msec measured would not be unusual for context creation on a system with large amounts of memory.

wlangdon · March 3, 2018, 8:06pm

Dear njuffa,
Thank you for the prompt reply.
Yes indeed the two GPUs are in very different machines, I mention GTX 745
since the startup delay is much less.

Tesla K20’s server has 12GB
GTX 745’s desktop has 32GB.

[1] — I will investigate, do I need root permission?
edit: Ok this looks like the cause? Linux ps gives no sign of the daemon running.
The nvidia “Persistence Daemon” seems pretty clear I need sys ops to get it going.
[2] my code uses explicit cudaMalloc and cudaMemcpy,
so I think it is not using a unified memory map.
[3] yip (at your recommendation;-) almost at the start of the program,
I call cudaFree(0) immediately after cudaGetDeviceProperties

Thank you
Bill

njuffa · March 3, 2018, 8:21pm

[1] Probably, but not sure, as I haven’t used the persistence daemon yet. The instructions at the page I pointed to look detailed and comprehensive, however, so I would suggest simply working through those. I don’t know whether the legacy persistence mode, turned on via nvidia-smi, still works at this time.

[2] CUDA prepares a unified 64-bit address map so all memory in the system is accessible at unique addresses. This is completely independent of an app’s usage of particular allocation or copy functions (which is not known at driver / runtime startup anyhow).

[3] I think cudaGetDeviceProperties() is the only CUDA API function that does not trigger context creation, so it would make sense that cudaFree(0) absorbs all the overhead.

Is the 12 GB for the server a typo and should really be 128 GB? 12 GB of system memory seems incredibly small for a server. Ideally, a host’s system memory should be four times the size of the total GPU memory in the system, but at least twice the size.

If the server really has a much smaller system memory than the desktop, lack of persistence would seem to be the strongest hypothesis that explains your observations. Lengthy startup due to address space mapping is typically seen in server systems with 100+ GB of system memory.

How does the CPU/memory performance of the server compare to that of the desktop machine? Servers often sport CPUs with many cores but low frequency (~2 GHz), and thus low single-thread performance, which however is precisely what is needed to minimize host-system overhead in the CUDA stack. Servers may also use slower speed grades of memory than desktop machines (partially counterbalanced by providing more memory channels and larger caches in server CPUs).

wlangdon · March 3, 2018, 8:41pm

Dear njuffa,
Again thank you for your prompt reply.
I double checked the K20 server.
Top says Mem: 12317928k total (ie 11.7473 Giga bytes)
its an old machine but has 8 2.67GHz cores.
I guess I should have mentioned it has two K20c (but I am only using one at present).
These are the only things that show up with deviceQuery.

The desktop has 8 3.6GHZ cores.
I am logged in (so X-11 is running) but trying to impose minimal load when
benchmarking the GTX 745.

Do I need root, ie system privileges, to run nvidia-smi ?
Bill

njuffa · March 3, 2018, 8:50pm

As I recall, some but not all nvidia-smi functionality requires root privileges. What features are inaccessible to normal users has changed slightly over the years, if memory serves. nvidia-smi will certainly complain when an operation is attempted with insufficient privileges, so one way to find out is to simply try :-)

wlangdon · March 3, 2018, 9:08pm

Yip, as you guessed, I do need root:

/usr/bin/nvidia-smi -i 0 -pm ENABLED
Unable to set persistence mode for GPU 00000000:04:00.0: Insufficient Permissions
Terminating early due to previous errors.

Bill

wlangdon · March 4, 2018, 11:49am

Ok, I have faked my own persistence-daemon by running another program on the
same sever which starts in the same way but immediately after cudaFree(0)
it calls unix pause() and so hangs holding on to CUDA resources forever.

Bill
ps: The K20 start up delay is now 413 milliseconds. This is clearly much
better than almost 3 seconds but still seems a lot.

njuffa · March 4, 2018, 2:04pm

That’s one way of working around the lack of root privileges :-)

So with driver persistence now in place, has the startup time for your actual application been reduced to an acceptable duration?

wlangdon · March 4, 2018, 2:25pm

Dear njuffa,
Sorry our messages seem to have crossed.
The current startup overhead figure 413 milliseconds is based on
many runs. Obviously much better but still worryingly high.
For example if I run nvprof on a small (but not unrealistic example)
nvprof still says most of the overhead is in cudaFree(0), approx 240ms (it varies).
Whereas nvprofs says the kernels and the cudamem copies take about 2.5ms in total.

njuffa · March 4, 2018, 2:42pm

How does this compare to the startup time on your desktop system?

I am not sure what more can be done here from the user side other than using the fastest host system available (my standard recommendations include high-frequency CPU, quad-channel DDR4-2666, NVMe SSD).

You said there are actually two Tesla K20 in the system, if I recall correctly? Maybe disabling the second unused GPU with the CUDA_VISIBLE_DEVICES environment variable would help? I have never tried that approach before to influence startup time; not sure how much of a difference it would make. Worth a quick try if you are sufficiently desperate.

One difference between Tesla K20 and a GeForce consumer card is ECC support on the former. I have vague recollection that ECC scrubbing is part of the startup overhead on GPUs with ECC. If so, I don’t know how much time is spend in that and whether turning ECC off would make much difference. Since ECC on/off is a global GPU setting, root privileges are required by nvidia-smi to change it (and it may even require a system reboot to take effect).

You could obviously file an RFE with NVIDIA requesting lower CUDA startup cost. Given the expanded feature set supported by modern CUDA, setting up a GPU context is somewhat akin to booting an OS, so I am not sure what further optimizations are possible. Philosophically, CUDA is targeted at heavy computational lifting, not necessarily mini-applications whose total run-time is in the sub-second range, as yours seems to be.

wlangdon · March 4, 2018, 6:49pm

Dear njuffa,
Once again, many thanks for your help. A quick test
suggests that using CUDA_VISIBLE_DEVICES to shut out the second K20
does not make much (if any) difference to the start up time.

On the GTX 745 the start up time is 145ms (average of many runs)
nvprof says cudaFree(0) takes about 200ms on first run (no persistence daemon)
and 100ms afterwards.

Many thanks
Bill

njuffa · March 4, 2018, 7:06pm

Your desktop system is unlikely to need the persistence daemon because it is presumably running X for a graphical desktop, which should keep the driver loaded.

Unfortunately there are too many different variables in play here to discern which factors (speed of host system, amount of memory in each system, GPUs used, etc) contribute how much to the overall CUDA startup cost.

Topic		Replies	Views
Tesla C2050 slower than GeForce 8800? CUDA Programming and Performance	14	20905	April 20, 2011
K20 with high utilization, but no compute processes. CUDA Setup and Installation	12	26687	March 19, 2015
reduce overhead of launching a new thread block CUDA Programming and Performance	15	4631	February 15, 2018
Noob Alert: Tesla K20 slower than GTX 580? CUDA Programming and Performance	24	9148	November 3, 2013
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1753	July 19, 2022
The Cuda 5 Second execution-time limit Finding a the way to work around the GDI timeout CUDA Programming and Performance	24	12717	July 26, 2010
Slow CUDA programs' startup CUDA Programming and Performance	10	7264	January 23, 2012
CUDA very slow performance CUDA Programming and Performance	21	16719	March 6, 2020
advice needed by a PhD student CUDA Programming and Performance	26	2864	December 4, 2011
CUDA initialization takes long time that varies up to 30 seconds on Amazon p3.16xlarge Windows machi... CUDA Programming and Performance	5	1575	December 8, 2019

why 2.9 seconds to start tesla K20

Related topics