CUDA setup times (create context, malloc, destroy context) some measurements included

CUDA setup time is the time that is necessary to initialize the CUDA context on the GPU, malloc of memory and the release of the CUDA context.

The setup time is especially important for small problems, like simple image filters. It shows that it’s crucial to keep the CUDA context alive to avoid this overhead in every new CUDA computation.

These times are quite constant, but there might also peaks that occur randomly.

Times in miliseconds

CreateContext 15,8

GetDeviceProperties 0

Malloc 29,5

Memset 0

ThreadSynchronize 0 (without waiting for any real synchronization)

Free 0,3

ThreadExit 4,2

The x-Axis is the amount of memory in MByte, the y-Axis shows the time in miliseconds.

You get ~ 50 ms constant overhead due to initialization of CUDA for a single kernel run. The biggest fraction of this time is caused by cudaMalloc, which takes about 30 ms. A further source for overhead are the data transfers from and to the device. This overhead scales linear to the amount of data transfered.

I wrote a simple benchmark and timed every CUDA call, additonally I made the amount of memory adjustable by a commandline parameter. A batch script run the benchmark with different parameter (1 - 128) and the times are put in a file in a semicolon separated file.

By knowing the execution time on the CPU and constant overhead and the transfer times for certain amounts of data on the GPU, we can calculate the maximal speedup using CUDA.

To do this we use Amdal’s law.

Assuming the algorithms takes 200 ms on the CPU and takes 32 MB of data. We doing this calculation only once and cannot reuse the CUDA context due to design restrictions in your existing code (every calculation creates a new thread). So our overhead will be about 100 ms for the cuda setup. 50 ms constant overhead and 50 ms transfer times.

In such a case the speedup (200 ms / 100 ms) will be in the optimal case only a factor of ~2x. This means however, that the parallel computation on CUDA does not need any time.

If the algorithm would take 1000 ms and the amount of data is 110 MB (150 ms CUDA overhead), the maximal speedup would be ~6,7 in the best case. ( I’m aware that I could use pinned memeory to accelerate the data transfer. To do that I would need to allocate pinned memory and afters copy the data from non-pinned to pinned memory, which would need additional time and extra memory space on the host. )

For bigger amounts of data the transfer time gets dominant and the constant overhead is negligible. In this case it is important that the ratio between (calculations done on data)/(data size) is high enough to achieve a good speedup on the GPU.

There are many applications arround, that can make good use of gpu computing, but in most consumer application you don’t have this premises of big and compute intensive data, because the software must run on the existing hardware. Quite often you have many small jobs (only miliseconds) that have to be done and CUDA is in such cases not well suited.

I used a Quadro FX 4800 and Xeon 3450 machine to do this benchmarks with CUDA 2.3 on Windows XP x64.

Have you tried doing another cuda malloc ? after the first one ? dose each one still take 30 ms ?

Now I did a benchmark of successiv cudaMalloc calls.

Only the first one takes so long. Next calls take less than 1 ms (exept from peak values) (16 MByte are malloced in each call). This is true even for bigger chunks like 200 MB.

MB; Time in miliseconds

0; 41.2998077

16; 0.0800781

32; 0.3796290

48; 0.5645742

64; 0.1327184

80; 0.1371319

96; 12.7466021

112; 0.0783594

128; 0.0746495

...

1312; 0.0788387

1328; 0.0780378

1344; 0.0781197

1360; 0.0782663

1376; 0.0790692

1392; 0.0831405

is the malloc the first cuda command you run ? if so then there is the driver initialization and kernel loading. from your tests i gather that these happen for every new process. I wonder if there is a way to keep it ready. if cuda has one live context on the gpu and then you add another for example will it take this long to fire things up ?

I am pretty sure those costs are all per context. Which means you really want to either keep a persistent thread holding a live context to do all work gpu work, or use the context migration API. The analysis is interesting, but I can’t agree with the conclusions.

No it is not the first cuda command. The first one is cudaGetDeviceCount. This command takes ~ 15 miliseconds if it is the first one. Succesiv calls take less than 0.1 ms. So I called the measurement “create context”.

The first cudaMalloc call takes ~ 30 miliseconds. If I use cudaMalloc as first call I get ~ 43 miliseconds, which is about the time to create the context and the first cudaMalloc call.

I have not done any tests yet, but I assume that it takes that long for every new thread that is created.

This values are only true for Windows XP. On Windows Vista / 7 I expect even a longer setup time. Kernel calls take in any case longer.

I can put the code online, if someone is interested.

In Linux, if you run “nvidia-smi --loop-continuously” on background, deviceQuery/banwdithTest/etc will start significantly faster.

You may want to try that if there is similar utility in the windows.

In which aspects do you disagree with?

tmuarray reported that the context migration API is also costly, but I don’t how much.

The whole thing is premised on the fact that it is necessary to establish a new context every time you want to do something with CUDA. I don’t believe that is a realistic or representative usage model, that’s all. Most of the fixed costs, including memory allocation, can be amortized to near 0 over the life of an application which only ever establishes one context. Which is probably the majority of cases.

cudaGetDeviceCount doesn’t create a context. The first cudaMalloc will create a context. If you want to force context creation before a cudaMalloc, use cudaFree(0).

CapJo, thanks for reporting the results! These kinds of details always help design.

Could you post the framework code you use for the measurement? It’s fine if it’s crude.

I did my own context create, cudaMalloc, and kernel launch speed tests about a year ago and found speeds much faster for malloc than your 30ms, but of course I don’t have the specific data any longer so I can’t give a real comparison. But now I’m curious again!

Do TWO cudaMallocs then take 60ms?

@tmurray: wow, great hack, I can use that! I often am still organizing data on the CPU before copying over to the GPU so I have a lull time where I know the device to use but don’t know the mallocs I need yet. I was actually considering adding a dummy cudaMalloc of 1 byte to create a context while the CPU was still preparing.

But why does the first cudaGetDeviceCount takes about 15 ms and the sencond less then 0,1 ms, if it does not create a context?

Only the first cudaMalloc takes so long, the second takes far less than a milisecond.

Here is my code. It runs only with windows, since it uses high performance counters for measurements (not necessary for this measurement).

I give no guarantees for correctnes.

#include <windows.h>

#include <stdio.h>

#include <stdlib.h>

#include <cuda_runtime.h>

#define MEGABYTE 1024 * 1024

// allocate 1536 MB memory on the host

char h_data[1536 * MEGABYTE];

int main(int argc, char** argv)

{

	//cuda data

	cudaError_t cuda_status;

	int cudaDeviceCount;

	cudaDeviceProp cudaDeviceInfo;

	char* d_data;

	// other stuff

	int data_volume = atoi(argv[1]) * MEGABYTE;

	FILE* benchmarkResults;

	benchmarkResults = fopen("CUDA_benchmark_results.csv","a");

	// time measurement variables

	LARGE_INTEGER start_ticks, end_ticks, ticksPerSecond;

	double TcudaCreateContext, TcudaGetDeviceCount, TcudaGetDeviceProperties, TcudaThreadSynchronize, 

		TcudaMalloc, TcudaMemcpyHostToDevice, TcudaMemcpyDeviceToHost, TcudaMemset, TcudaKernelCall, TcudaFree, 

		TcudaThreadExit, completeTime;

	QueryPerformanceFrequency(&ticksPerSecond);

	QueryPerformanceCounter(&start_ticks);

	cuda_status = cudaGetDeviceCount(&cudaDeviceCount);

	QueryPerformanceCounter(&end_ticks); 

	if(cuda_status != cudaSuccess){fprintf(benchmarkResults, "%s\n", cudaGetErrorString(cuda_status));}

	TcudaCreateContext = 1000.0*(double)(end_ticks.QuadPart- start_ticks.QuadPart)/(double)ticksPerSecond.QuadPart;

	QueryPerformanceCounter(&start_ticks);

	cuda_status = cudaGetDeviceCount(&cudaDeviceCount);

	QueryPerformanceCounter(&end_ticks); 

	if(cuda_status != cudaSuccess){fprintf(benchmarkResults, "%s\n", cudaGetErrorString(cuda_status));}

	TcudaGetDeviceCount = 1000.0*(double)(end_ticks.QuadPart- start_ticks.QuadPart)/(double)ticksPerSecond.QuadPart;

	QueryPerformanceCounter(&start_ticks);

	cuda_status = cudaGetDeviceProperties (&cudaDeviceInfo, 0);

	QueryPerformanceCounter(&end_ticks); 

	if(cuda_status != cudaSuccess){fprintf(benchmarkResults, "%s\n", cudaGetErrorString(cuda_status));}

	TcudaGetDeviceProperties = 1000.0*(double)(end_ticks.QuadPart- start_ticks.QuadPart)/(double)ticksPerSecond.QuadPart;

	QueryPerformanceCounter(&start_ticks);

	cuda_status = cudaMalloc ((void**)&d_data, data_volume);

	QueryPerformanceCounter(&end_ticks); 

	if(cuda_status != cudaSuccess){fprintf(benchmarkResults, "%s\n", cudaGetErrorString(cuda_status));}

	TcudaMalloc = 1000.0*(double)(end_ticks.QuadPart- start_ticks.QuadPart)/(double)ticksPerSecond.QuadPart;

	QueryPerformanceCounter(&start_ticks);

	cuda_status = cudaMemset ((void*)d_data, 0, data_volume);

	QueryPerformanceCounter(&end_ticks); 

	if(cuda_status != cudaSuccess){fprintf(benchmarkResults, "%s\n", cudaGetErrorString(cuda_status));}

	TcudaMemset = 1000.0*(double)(end_ticks.QuadPart- start_ticks.QuadPart)/(double)ticksPerSecond.QuadPart;

	QueryPerformanceCounter(&start_ticks);

	cuda_status = cudaMemcpy ((void*)d_data, (void*)h_data, data_volume, cudaMemcpyHostToDevice);

	QueryPerformanceCounter(&end_ticks); 

	if(cuda_status != cudaSuccess){fprintf(benchmarkResults, "%s\n", cudaGetErrorString(cuda_status));}

	TcudaMemcpyHostToDevice = 1000.0*(double)(end_ticks.QuadPart- start_ticks.QuadPart)/(double)ticksPerSecond.QuadPart;

	QueryPerformanceCounter(&start_ticks);

	cuda_status = cudaThreadSynchronize();

	QueryPerformanceCounter(&end_ticks); 

	if(cuda_status != cudaSuccess){fprintf(benchmarkResults, "%s\n", cudaGetErrorString(cuda_status));}

	TcudaThreadSynchronize = 1000.0*(double)(end_ticks.QuadPart- start_ticks.QuadPart)/(double)ticksPerSecond.QuadPart;

	QueryPerformanceCounter(&start_ticks);

	cuda_status = cudaMemcpy ((void*)h_data, (void*)d_data, data_volume, cudaMemcpyDeviceToHost);

	QueryPerformanceCounter(&end_ticks); 

	if(cuda_status != cudaSuccess){fprintf(benchmarkResults, "%s\n", cudaGetErrorString(cuda_status));}

	TcudaMemcpyDeviceToHost = 1000.0*(double)(end_ticks.QuadPart- start_ticks.QuadPart)/(double)ticksPerSecond.QuadPart;

	QueryPerformanceCounter(&start_ticks);

	cuda_status = cudaFree ((void*)d_data);

	QueryPerformanceCounter(&end_ticks); 

	if(cuda_status != cudaSuccess){fprintf(benchmarkResults, "%s\n", cudaGetErrorString(cuda_status));}

	TcudaFree = 1000.0*(double)(end_ticks.QuadPart- start_ticks.QuadPart)/(double)ticksPerSecond.QuadPart;

	QueryPerformanceCounter(&start_ticks);

	cuda_status = cudaThreadExit();

	QueryPerformanceCounter(&end_ticks); 

	if(cuda_status != cudaSuccess){fprintf(benchmarkResults, "%s\n", cudaGetErrorString(cuda_status));}

	TcudaThreadExit = 1000.0*(double)(end_ticks.QuadPart- start_ticks.QuadPart)/(double)ticksPerSecond.QuadPart;

	completeTime = TcudaCreateContext + 

		TcudaGetDeviceCount + 

		TcudaGetDeviceProperties +

		TcudaMalloc +

		TcudaMemset +

		TcudaMemcpyDeviceToHost + 

		TcudaThreadSynchronize +

		TcudaMemcpyHostToDevice +

		TcudaFree + 

		TcudaThreadExit;

	fprintf(benchmarkResults, "%d; %.7f; %.7f; %.7f; %.7f; %.7f; %.7f;  %.7f; %.7f; %.7f; %.7f; %.7f\n",

		atoi(argv[1]),

		TcudaCreateContext, 

		TcudaGetDeviceCount, 

		TcudaGetDeviceProperties ,

		TcudaMalloc,

		TcudaMemset,

		TcudaMemcpyHostToDevice,

		TcudaThreadSynchronize,

		TcudaMemcpyDeviceToHost,

		TcudaFree, 

		TcudaThreadExit,

		completeTime);

	fclose(benchmarkResults);

}

I’ve found repeated malloc() and free() calls to be slow (and unpredictably VERY slow on occasion, as your data also shows). It’s better to allocate as much GPU memory as you’ll ever need and then manage it yourself with nasty pointer tricks. Thanks for the graphs.

You know how the driver API has an explicit cuInit call but the runtime API doesn’t? That cost has to be paid by the first CUDA call regardless of whether it’s creating a context or not.

I know that, but I assumed that the first cuda call creates the context.

The first call does the initialization (CUinit) and the function call itself. (~ 15 ms)

cudaMalloc does the context creation. (~ 30 ms)

Managing the memory by yourself is somewhat “dirty”. You have to pay attention that your data is 64-byte aligned and if this value changes in future, you have to check and modify your code.

I had this idea also in mind, but future proof and clean code was more important, but your solution avoids mallocs, which might cost additional time.

I think the conclusion is that you need to do a parallel init. Let me explain, if you want to run a process on just one photo, just opening photo shop or any other real program will take you a good bunch of seconds, if you want to use a gpu in that program, just initialize the environment and use it when u need. In most real world situations you don’t need to do a tiny task only once with some tiny executable. Those are really the only situations were i agree that the time penalties of initializations of the gpu are too costly. For the most part people don’t run things for 0.8 sec, and if they do they really really dont care if it takes 0.8 or 0.2 sec …

I met the same issue on C2050. Thanks for the analysis!

great checkup,
could s.o. reproduce this, unfortunately the code isnt functional on my machine :-(
i get an fatal error every time in “_debugger_hook_dummy = 0;” @ dbghook.c

I haven’t tested the code since I wrote it more than one year ago.

In the current state of this code, it is only working on Windows, due to
windows specific time measurements “QueryPerformanceCounter(&start_ticks)” and
it consumes 1.5 GByte of memory (char h_data[1536 * MEGABYTE]; ).

Decrease the amount of memory if you have not enough on your machine and pay
attention that you have to specify one argument when you start it (chunk size
in MByte when memory is allocated on the gpu).

it works :thanks:
i somehow missed the argv