CUDA setup time is the time that is necessary to initialize the CUDA context on the GPU, malloc of memory and the release of the CUDA context.
The setup time is especially important for small problems, like simple image filters. It shows that it’s crucial to keep the CUDA context alive to avoid this overhead in every new CUDA computation.
These times are quite constant, but there might also peaks that occur randomly.
Times in miliseconds
CreateContext 15,8
GetDeviceProperties 0
Malloc 29,5
Memset 0
ThreadSynchronize 0 (without waiting for any real synchronization)
Free 0,3
ThreadExit 4,2
The x-Axis is the amount of memory in MByte, the y-Axis shows the time in miliseconds.
You get ~ 50 ms constant overhead due to initialization of CUDA for a single kernel run. The biggest fraction of this time is caused by cudaMalloc, which takes about 30 ms. A further source for overhead are the data transfers from and to the device. This overhead scales linear to the amount of data transfered.
I wrote a simple benchmark and timed every CUDA call, additonally I made the amount of memory adjustable by a commandline parameter. A batch script run the benchmark with different parameter (1 - 128) and the times are put in a file in a semicolon separated file.
By knowing the execution time on the CPU and constant overhead and the transfer times for certain amounts of data on the GPU, we can calculate the maximal speedup using CUDA.
To do this we use Amdal’s law.
Assuming the algorithms takes 200 ms on the CPU and takes 32 MB of data. We doing this calculation only once and cannot reuse the CUDA context due to design restrictions in your existing code (every calculation creates a new thread). So our overhead will be about 100 ms for the cuda setup. 50 ms constant overhead and 50 ms transfer times.
In such a case the speedup (200 ms / 100 ms) will be in the optimal case only a factor of ~2x. This means however, that the parallel computation on CUDA does not need any time.
If the algorithm would take 1000 ms and the amount of data is 110 MB (150 ms CUDA overhead), the maximal speedup would be ~6,7 in the best case. ( I’m aware that I could use pinned memeory to accelerate the data transfer. To do that I would need to allocate pinned memory and afters copy the data from non-pinned to pinned memory, which would need additional time and extra memory space on the host. )
For bigger amounts of data the transfer time gets dominant and the constant overhead is negligible. In this case it is important that the ratio between (calculations done on data)/(data size) is high enough to achieve a good speedup on the GPU.
There are many applications arround, that can make good use of gpu computing, but in most consumer application you don’t have this premises of big and compute intensive data, because the software must run on the existing hardware. Quite often you have many small jobs (only miliseconds) that have to be done and CUDA is in such cases not well suited.
I used a Quadro FX 4800 and Xeon 3450 machine to do this benchmarks with CUDA 2.3 on Windows XP x64.