Reducing overhead Is there a way to minimize CUDA overhead

I know this topic has been discussed before but i’d like to have some more explanation…

The time taken to memcpy one byte, execute a blank kernel and read back one byte is very big. The portion of code I timed is the following:

cuMemcpyHtoDAsync(d_mem, h_mem, 1, stream);

cuLaunchGridAsync(BlankKernel, 1, 1, stream);

cuMemcpyDtoHAsync((void*)h_mem, d_mem, 1, stream);


Using the timer from cutil.h, I get an execution time on the CPU of about 30 micro seconds. I’m running a TESLA C870 on rhel5.

For my application, I need to run a matrix multiplication under 50 us so 30 us of overhead is really bad… If I just time the two memcpy, it takes about 25 us, which is quite slow to only write and read one byte via PCIe. There has to be a way to do this faster.

Is there a way to go at a even lower level than the driver API? Can Nvidia provide information on how to access the GPU memory? Perhaps I could write my own PCIe->GPU driver… Or is there a way for another device (like a FPGA) to write and read memory on the GPU?

Any help would be greatly appreciated.

Hope this isn’t a naive suggestion, but it’s worth mentioning:
Have you tried running a thousand iterations to eliminate startup overhead?

Just throw a for loop to 1000 around the code, to see what happens.

Yea, that’s exactly what I’m doing, I time the average over 1000 iterations and I discard the first one which contains initialization overhead.

Thanks for the reply

If you’re trying to transfer such small amounts of memory, you might be able to get it out to the card faster by pushing it directly onto the call stack and then accessing it via the extern shared segment. In my experiments, the extern shared segment seems to always be aligned to the next 16byte boundary after the kernel parameters.

I don’t know any such trick for getting it back faster than cudaMemcpy. Anyone else?

See if the following code works for you. It should print out ‘ffff0000 ffff0001 ffff0002…’.

#include <stdio.h>

#include <cuda.h>

__global__ void kernel(int *mem)


    extern __shared__ int extra[];


    for (int i = 0; i < 10; i++)

        mem[i] = extra[i];


int main(void)


    int n = 10, bytes = n * sizeof(int);

   int *h_data, *d_data;

   cudaMalloc((void **)&d_data, bytes);

    h_data = (int *)malloc(bytes);

   int extra[10];

    for (int i = 0; i < n; i++)

        extra[i] = i + 0xffff0000;

    cudaConfigureCall(1, n, 1<<10); // 1 kbyte for good measure

    cudaSetupArgument(&d_data,  4,  0);

    cudaSetupArgument(extra,   40, 16);


   cudaMemcpy(h_data, d_data, bytes, cudaMemcpyDeviceToHost);

   for(int i = 0; i < n; i++)

        printf("%08x ", h_data[i]);


   return 0;