Questions about cudaMalloc Questions about runtime for cudaMalloc and cudaMemcpy


I’m currently working on a multi gpu cuda programm. So i have up to 4 devices (and so 4 host threads) running the same kernel on different data.

My first step after launching the host threads is to allocate some device memory using multiple cudaMalloc calls. After that i copy data to the device using cudaMemcpy.

I measured the time for the cudaMalloc calls and the time for the cudaMemcpy calls.

I have two questions about this:


Why are the cudaMalloc calls significant slower than the cudaMemcpy calls ? The cudaMalloc calls take about 15 times longer then the cudaMemcpy calls, if i work with only one device. (For multiple devices it get even worse)

If i work with multiple devices the cudaMalloc times gets worse. I got the following times for the cudaMalloc calls:

1 GPU: 0.36240400 seconds

2 GPUs: 0.70018800 seconds

4 GPUs: 1.16176900 seconds

So my question is: Are the cudaMalloc calls synchronized over multiple host threads or what is a possible reason for this times?

Hope to get some answers. Greetings from Germany. Michel

I am going to guess that you are using windows (maybe Vista)? What I am guessing you are seeing is the overhead associated with establishing context with each GPU (looks like about 300ms per GPU). I would also guess that subsequent mallocs will be much, much faster, it is just the first operation on each context which is slow.

But all of this is just a wild guess. If it is any help, that doesn’t happen in any of the Linux versions of CUDA I have tried.