Time differences


I’m developing a system using 3 threads to manage device allocations, host <> device transfers and kernel executions of a set of tasks (the goal is to share a GPU between a set of task launched in a VM).

I’m using the example of vectors addition (from the SDK) to test the system. I created a function AddKer that is placed in a shared library and used by the system when a task needs it.

void AddKer(const float* A, const float* B, float* C){

	int N = 500000;

	// Invoke kernel

	int threadsPerBlock = 256;

	int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;


	VecAdd<<<blocksPerGrid, threadsPerBlock>>>(A, B, C, N);


When the system launches the VecAdd kernel, it takes 100 milli seconds to execute itself.

void * libCont;

    	void (*funct)(void *, void *, void *);


	if (!((libCont = dlopen("./vectorAdd.so", RTLD_LAZY)))){

		printf("Error dlopen : %s\n", dlerror());



	if (!(funct = (void (*)(void *, void *, void *)) dlsym(libCont, "AddKer"))) {


		printf("Error dlsym : %s\n", dlerror());



funct(d_A, d_B, d_C);

When I launch the same kernel with data of same sizes in a common program, it takes 82 micro seconds. I try to understand what’s the cause of this difference but I can’t find it. I already tried :

  • to use the shared library in the common program, but it doesn’t take longer

  • to increment the elements of d_A and d_B so they are no constant variables and there can’t be implicit optimizations, but nothing changes

  • as my system manages the device memory (it allocates all the device memory when it’s launched and then gives pointers to available memory blocs to tasks needing it), I thought that maybe the cudaMalloc gives pointers to specific addresses so the access is optimized. I tried to make a cudaMallocs for the three vectors just before the kernel invocation to benefit of this potential implicit optimization but nothing changes.

Does somebody have an idea of why the execution time of a cuda kernel can differ please ? I don’t understand where is the difference between the two use cases that can explain the time difference.

Thanks for your help.

I think I found where the problem is. The time is consumed by the function cudaThreadSynchronize(); just before the call to the kernel. This is maybe due to the multithreaded management of the GPU. As the GPU is mangaged by several threads, the synchronisation takes more time. It’s a really big disadvantage that dissuades to use multithreading on a single GPU :s

Does somebody already had that kind of experience ?

Best regards.