Reducing overhead Is there a way to minimize CUDA overhead

Vince33 · June 16, 2008, 6:53pm

I know this topic has been discussed before but i’d like to have some more explanation…

The time taken to memcpy one byte, execute a blank kernel and read back one byte is very big. The portion of code I timed is the following:

cuMemcpyHtoDAsync(d_mem, h_mem, 1, stream);

cuLaunchGridAsync(BlankKernel, 1, 1, stream);

cuMemcpyDtoHAsync((void*)h_mem, d_mem, 1, stream);

cuCtxSynchronize();

Using the timer from cutil.h, I get an execution time on the CPU of about 30 micro seconds. I’m running a TESLA C870 on rhel5.

For my application, I need to run a matrix multiplication under 50 us so 30 us of overhead is really bad… If I just time the two memcpy, it takes about 25 us, which is quite slow to only write and read one byte via PCIe. There has to be a way to do this faster.

Is there a way to go at a even lower level than the driver API? Can Nvidia provide information on how to access the GPU memory? Perhaps I could write my own PCIe->GPU driver… Or is there a way for another device (like a FPGA) to write and read memory on the GPU?

Any help would be greatly appreciated.

kristleifur · June 16, 2008, 8:51pm

Hope this isn’t a naive suggestion, but it’s worth mentioning:
Have you tried running a thousand iterations to eliminate startup overhead?

Just throw a for loop to 1000 around the code, to see what happens.

Vince33 · June 16, 2008, 10:26pm

Yea, that’s exactly what I’m doing, I time the average over 1000 iterations and I discard the first one which contains initialization overhead.

Thanks for the reply

James_Malcolm1 · August 9, 2008, 2:40pm

If you’re trying to transfer such small amounts of memory, you might be able to get it out to the card faster by pushing it directly onto the call stack and then accessing it via the extern shared segment. In my experiments, the extern shared segment seems to always be aligned to the next 16byte boundary after the kernel parameters.

I don’t know any such trick for getting it back faster than cudaMemcpy. Anyone else?

See if the following code works for you. It should print out ‘ffff0000 ffff0001 ffff0002…’.

#include <stdio.h>

#include <cuda.h>

__global__ void kernel(int *mem)

{

    extern __shared__ int extra[];

    

    for (int i = 0; i < 10; i++)

        mem[i] = extra[i];

}

int main(void)

{

    int n = 10, bytes = n * sizeof(int);

   int *h_data, *d_data;

   cudaMalloc((void **)&d_data, bytes);

    h_data = (int *)malloc(bytes);

   int extra[10];

    for (int i = 0; i < n; i++)

        extra[i] = i + 0xffff0000;

    cudaConfigureCall(1, n, 1<<10); // 1 kbyte for good measure

    cudaSetupArgument(&d_data,  4,  0);

    cudaSetupArgument(extra,   40, 16);

    cudaLaunch(kernel);

   cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost);

   for(int i = 0; i < n; i++)

        printf("%08x ", h_data[i]);

    printf("\n");

   return 0;

}

Topic		Replies	Views
Slow memory transfers CUDA Programming and Performance	7	1994	May 23, 2011
cuMemcpy Overhead cuMemcpy call has unacceptable overhead CUDA Programming and Performance	6	2991	July 31, 2008
cudaMemcpy() Best approach when you need to call it many times? CUDA Programming and Performance	8	25086	March 8, 2010
cudaMemcpy execution time CUDA Programming and Performance	5	6806	June 17, 2010
Performance in basic algorithm Why isn't faster? CUDA Programming and Performance	4	1667	January 9, 2009
How to copy small data from GPU to CPU many times efficiently? CUDA Programming and Performance	1	1242	December 11, 2014
cudaMemcpy takes 30% of my project time. CUDA Programming and Performance	5	4400	July 20, 2009
CUDA trouble CUDA Programming and Performance	3	977	March 19, 2013
Overhead using cudaMemcpyAsync CUDA Programming and Performance	5	3204	September 1, 2009
Using GPU<->CPU polling to reduce overhead CUDA Programming and Performance	12	10415	November 21, 2007

Reducing overhead Is there a way to minimize CUDA overhead

Related topics