Kernel dimension influences cudaMemcpy?

I am running some performance tests, using a kernel with different grid/block sizes.
I’ve re-implemented my own cublasSaxpy function using my custom kernel, to see if I can get the same speeds as the original function from the Cublas library. (This function multiplies a vector with alpha and adds it to another vector).
I’m using two input vectors of 16777216 elements. I measure 3 stages of the execution, using the timing functions from the cutil library (cutCreateTimer, cutStartTimer, cutStopTimer).
First I measure the cudaMemcpy of the 2 vectors from host to device.
Then the actual kernel run.
Finally I measure the cudaMemcpy of the result vector from device to host.

When I create a custom kernel with 32768 blocks * 512 threads (so that each thread will calculate exactly one vector element), I get the same timing results as the original cublas function (where I measure the cublasSetVector, cublasSaxpy and cublasGetVector).
Then I modified my kernel so that it can handle N elements per thread. In this way I can create a smaller grid with less blocks and threads, just to see how this influences the execution speed.
What I see is that the actual kernel run itself takes about the same time after each run when comparing it to the cublasSaxpy function (about 0.02 ms). The copying of the data from host to device also takes the same time (about 95ms for copying two vectors of 16777216 elements).
However the timing of the cudaMemcpy of the result vector back from device to host takes longer when the grid size is decreased so that each thread has to calculate more elements per thread!
(E.g. this takes about 45 ms for the 32768512 grid, but 93 ms for a 1024128 grid where each thread calculates 128 vector elements)
What I find strange is that the actual kernel execution is almost not influenced when using different grid configurations. But the cudaMemcpy after the kernel execution takes longer.
I put the cutStartTimer just before the cudaMemcpy and the cutStopTimer just after. So I wonder why memory transfer from device to host is influenced by the dimensions of the grid?
Is there an explanation for this?
How reliable are the timing functions from the cutil library?


You do use cudaThreadSynchronize() before cutStartTimer and cutEndTimer?

Seems not, if the time goes into the memory copy function, which automatically syncs.

OK, that’s it. I’ve added an extra call to cudaThreadSynchronize() and now the memcopy speeds are the same. But if the extra time goes into the call to cudaThreadSynchronize(), can you also explain why this call takes longer after a kernel invocation with a smaller grid in which each thread calculates more elements?

OK, I think I now understand what’s happening. The actual kernel call only starts the kernel on the GPU, but returns immediately. The call to cudaThreadSynchronize() waits until the kernel has finished. So if I want to measure the time that the actual kernel run takes I have to measure the cudaThreadSynchronize() call and not the kernel call.

And probably with a smaller grid, the kernel runs less efficient, so the actual kernel run takes longer.

Is this the way it works?

Is there also a way in CUDA to install a callback function so that I can start the kernel, continue my work on the CPU, and handle the kernel results in the callback function?



Essentially correct: kernel invocations are asynchronous. To time the kernel you should start the timer before calling the kernel and stop it after cudaThreadSynchronize() returns. You could also use the nvcc on board profiler functions.

A way I can think of is using multiple threads: prepare and launch the kernel from one thread and do stuff in another thread while the thread caring for the kernel is waiting for the kernel to finish. Make sure you handle GPU stuff always in the same thread. As mentioned somewhere on this forum different CPU threads have different memory contexts on the GPU.