I am running some performance tests, using a kernel with different grid/block sizes.
I’ve re-implemented my own cublasSaxpy function using my custom kernel, to see if I can get the same speeds as the original function from the Cublas library. (This function multiplies a vector with alpha and adds it to another vector).
I’m using two input vectors of 16777216 elements. I measure 3 stages of the execution, using the timing functions from the cutil library (cutCreateTimer, cutStartTimer, cutStopTimer).
First I measure the cudaMemcpy of the 2 vectors from host to device.
Then the actual kernel run.
Finally I measure the cudaMemcpy of the result vector from device to host.
When I create a custom kernel with 32768 blocks * 512 threads (so that each thread will calculate exactly one vector element), I get the same timing results as the original cublas function (where I measure the cublasSetVector, cublasSaxpy and cublasGetVector).
Then I modified my kernel so that it can handle N elements per thread. In this way I can create a smaller grid with less blocks and threads, just to see how this influences the execution speed.
What I see is that the actual kernel run itself takes about the same time after each run when comparing it to the cublasSaxpy function (about 0.02 ms). The copying of the data from host to device also takes the same time (about 95ms for copying two vectors of 16777216 elements).
However the timing of the cudaMemcpy of the result vector back from device to host takes longer when the grid size is decreased so that each thread has to calculate more elements per thread!
(E.g. this takes about 45 ms for the 32768512 grid, but 93 ms for a 1024128 grid where each thread calculates 128 vector elements)
What I find strange is that the actual kernel execution is almost not influenced when using different grid configurations. But the cudaMemcpy after the kernel execution takes longer.
I put the cutStartTimer just before the cudaMemcpy and the cutStopTimer just after. So I wonder why memory transfer from device to host is influenced by the dimensions of the grid?
Is there an explanation for this?
How reliable are the timing functions from the cutil library?