I’m currently working on a multi gpu cuda programm. So i have up to 4 devices (and so 4 host threads) running the same kernel on different data.
My first step after launching the host threads is to allocate some device memory using multiple cudaMalloc calls. After that i copy data to the device using cudaMemcpy.
I measured the time for the cudaMalloc calls and the time for the cudaMemcpy calls.
I have two questions about this:
Why are the cudaMalloc calls significant slower than the cudaMemcpy calls ? The cudaMalloc calls take about 15 times longer then the cudaMemcpy calls, if i work with only one device. (For multiple devices it get even worse)
If i work with multiple devices the cudaMalloc times gets worse. I got the following times for the cudaMalloc calls:
1 GPU: 0.36240400 seconds
2 GPUs: 0.70018800 seconds
4 GPUs: 1.16176900 seconds
So my question is: Are the cudaMalloc calls synchronized over multiple host threads or what is a possible reason for this times?
Hope to get some answers. Greetings from Germany. Michel