why 2.9 seconds to start tesla K20

I have a CUDA program originally written for a GTX 745 (compute level 5.0)
I recompiled all the CUDA code for Tesla K20
(nvcc release 9.1, V9.1.8 -gencode arch=compute_35,code=sm_35 )

Although once it gets going the K20 is much faster but
it routinely takes almost three seconds to start CUDA.

Mostly I am using network drives so I tried copying all inputs and
the linux image itself to a local disk but that make almost no difference.

nvprof says the (initial!) call to cudaFree takes up to 585.01ms

As always any help would be most welcome

Am I correct to assume that the system with the GTX 745 and the system with the K20 are not physically the same system? If so, am I correct to assume that the system with the K20 has a much larger system memory?

It seems you have already addressed some potential sources of startup overhead, such as JIT compilation due to GPU architecture mismatch and slow access to any files involved.

[1] Make sure that driver persistency is used, i.e. the driver stays resident even if CUDA is not in use. The modern approach is to use the persistence daemon: http://docs.nvidia.com/deploy/driver-persistence/index.html#persistence-daemon

[2] Startup times can be significant on systems with large amounts of memory (that is the total of all host memory + all GPU memory), all of which needs to be mapped into a unified memory map. To my knowledge, the work required is linear in the amount of total memory that needs to be mapped; beyond that execution speed depends on the performance of the host system (single-thread CPU, system memory).

[3] I assume the cudaFree() call is a cudaFree(0) at the start of your application, designed to absorb the cost of CUDA context creation, as context creation is implicit and triggered by the first CUDA API call. The 581 msec measured would not be unusual for context creation on a system with large amounts of memory.

Dear njuffa,
Thank you for the prompt reply.
Yes indeed the two GPUs are in very different machines, I mention GTX 745
since the startup delay is much less.

Tesla K20’s server has 12GB
GTX 745’s desktop has 32GB.

[1] — I will investigate, do I need root permission?
edit: Ok this looks like the cause? Linux ps gives no sign of the daemon running.
The nvidia “Persistence Daemon” seems pretty clear I need sys ops to get it going.
[2] my code uses explicit cudaMalloc and cudaMemcpy,
so I think it is not using a unified memory map.
[3] yip (at your recommendation;-) almost at the start of the program,
I call cudaFree(0) immediately after cudaGetDeviceProperties

Thank you

[1] Probably, but not sure, as I haven’t used the persistence daemon yet. The instructions at the page I pointed to look detailed and comprehensive, however, so I would suggest simply working through those. I don’t know whether the legacy persistence mode, turned on via nvidia-smi, still works at this time.

[2] CUDA prepares a unified 64-bit address map so all memory in the system is accessible at unique addresses. This is completely independent of an app’s usage of particular allocation or copy functions (which is not known at driver / runtime startup anyhow).

[3] I think cudaGetDeviceProperties() is the only CUDA API function that does not trigger context creation, so it would make sense that cudaFree(0) absorbs all the overhead.

Is the 12 GB for the server a typo and should really be 128 GB? 12 GB of system memory seems incredibly small for a server. Ideally, a host’s system memory should be four times the size of the total GPU memory in the system, but at least twice the size.

If the server really has a much smaller system memory than the desktop, lack of persistence would seem to be the strongest hypothesis that explains your observations. Lengthy startup due to address space mapping is typically seen in server systems with 100+ GB of system memory.

How does the CPU/memory performance of the server compare to that of the desktop machine? Servers often sport CPUs with many cores but low frequency (~2 GHz), and thus low single-thread performance, which however is precisely what is needed to minimize host-system overhead in the CUDA stack. Servers may also use slower speed grades of memory than desktop machines (partially counterbalanced by providing more memory channels and larger caches in server CPUs).

Dear njuffa,
Again thank you for your prompt reply.
I double checked the K20 server.
Top says Mem: 12317928k total (ie 11.7473 Giga bytes)
its an old machine but has 8 2.67GHz cores.
I guess I should have mentioned it has two K20c (but I am only using one at present).
These are the only things that show up with deviceQuery.

The desktop has 8 3.6GHZ cores.
I am logged in (so X-11 is running) but trying to impose minimal load when
benchmarking the GTX 745.

Do I need root, ie system privileges, to run nvidia-smi ?

As I recall, some but not all nvidia-smi functionality requires root privileges. What features are inaccessible to normal users has changed slightly over the years, if memory serves. nvidia-smi will certainly complain when an operation is attempted with insufficient privileges, so one way to find out is to simply try :-)

Yip, as you guessed, I do need root:

/usr/bin/nvidia-smi -i 0 -pm ENABLED
Unable to set persistence mode for GPU 00000000:04:00.0: Insufficient Permissions
Terminating early due to previous errors.


Ok, I have faked my own persistence-daemon by running another program on the
same sever which starts in the same way but immediately after cudaFree(0)
it calls unix pause() and so hangs holding on to CUDA resources forever.

ps: The K20 start up delay is now 413 milliseconds. This is clearly much
better than almost 3 seconds but still seems a lot.

That’s one way of working around the lack of root privileges :-)

So with driver persistence now in place, has the startup time for your actual application been reduced to an acceptable duration?

Dear njuffa,
Sorry our messages seem to have crossed.
The current startup overhead figure 413 milliseconds is based on
many runs. Obviously much better but still worryingly high.
For example if I run nvprof on a small (but not unrealistic example)
nvprof still says most of the overhead is in cudaFree(0), approx 240ms (it varies).
Whereas nvprofs says the kernels and the cudamem copies take about 2.5ms in total.

How does this compare to the startup time on your desktop system?

I am not sure what more can be done here from the user side other than using the fastest host system available (my standard recommendations include high-frequency CPU, quad-channel DDR4-2666, NVMe SSD).

You said there are actually two Tesla K20 in the system, if I recall correctly? Maybe disabling the second unused GPU with the CUDA_VISIBLE_DEVICES environment variable would help? I have never tried that approach before to influence startup time; not sure how much of a difference it would make. Worth a quick try if you are sufficiently desperate.

One difference between Tesla K20 and a GeForce consumer card is ECC support on the former. I have vague recollection that ECC scrubbing is part of the startup overhead on GPUs with ECC. If so, I don’t know how much time is spend in that and whether turning ECC off would make much difference. Since ECC on/off is a global GPU setting, root privileges are required by nvidia-smi to change it (and it may even require a system reboot to take effect).

You could obviously file an RFE with NVIDIA requesting lower CUDA startup cost. Given the expanded feature set supported by modern CUDA, setting up a GPU context is somewhat akin to booting an OS, so I am not sure what further optimizations are possible. Philosophically, CUDA is targeted at heavy computational lifting, not necessarily mini-applications whose total run-time is in the sub-second range, as yours seems to be.

Dear njuffa,
Once again, many thanks for your help. A quick test
suggests that using CUDA_VISIBLE_DEVICES to shut out the second K20
does not make much (if any) difference to the start up time.

On the GTX 745 the start up time is 145ms (average of many runs)
nvprof says cudaFree(0) takes about 200ms on first run (no persistence daemon)
and 100ms afterwards.

Many thanks

Your desktop system is unlikely to need the persistence daemon because it is presumably running X for a graphical desktop, which should keep the driver loaded.

Unfortunately there are too many different variables in play here to discern which factors (speed of host system, amount of memory in each system, GPUs used, etc) contribute how much to the overall CUDA startup cost.