multi-gpu and cudamemcpyasync

Greetings,

I am using 2 GPUs with pthreads. I am trying to use CudaMemcpyAsync from host to device for both of the GPUs (different CPU data) via cudaStreams but this doesn’t seem to work. The code works fine when I replace CudaMemcpyAsync with CudaMemcpy.

Can we use cudamemcpyasync with multi-GPUs? If so, what might be causing my problem? If not, why can’t we use asynchronous copies with more than 1 GPU? Thanks.

Uh, how are you managing contexts?

Uh, how are you managing contexts?

something akin to this (forgive me if there exists any mistakes in syntax)

for (int i = 0; i < TOTAL_GPU_NUM; i++)

pthread_create(… , func, …);

func()

{

cudaSetDevice(tid);

cudaStream_t stream1;

cudaStream_t stream2;

cudaStreamCreate(&stream1);

cudaStreamCreate(&stream2);

if (tid == 0) // GPU 0

{ …

  some_gpu_function_call(..., stream1);

  cudaMemcpyAsync(..., stream2);

}

else

{

.. 

some_gpu_function_call(...,   stream1);

cudaMemcpyAsync(..., stream2);

}

}

In the above code, if I replace cudaMemcpyAsync with cudaMemcpy, everything works well.

something akin to this (forgive me if there exists any mistakes in syntax)

for (int i = 0; i < TOTAL_GPU_NUM; i++)

pthread_create(… , func, …);

func()

{

cudaSetDevice(tid);

cudaStream_t stream1;

cudaStream_t stream2;

cudaStreamCreate(&stream1);

cudaStreamCreate(&stream2);

if (tid == 0) // GPU 0

{ …

  some_gpu_function_call(..., stream1);

  cudaMemcpyAsync(..., stream2);

}

else

{

.. 

some_gpu_function_call(...,   stream1);

cudaMemcpyAsync(..., stream2);

}

}

In the above code, if I replace cudaMemcpyAsync with cudaMemcpy, everything works well.

You’re not managing contexts, then. CUDART supports one device per thread; you can’t do what you’re trying to do.

You’re not managing contexts, then. CUDART supports one device per thread; you can’t do what you’re trying to do.

Hi,

I am pretty confused by this. In another code that uses just one GPU, I managed to successfully use cudaMemcpyAsync…

e.g.

pthread_create(…, func1, …);

pthread_create(…, func2, …);

func1()

{

cudaStream_t stream1;

cudaStream_t stream2;

cudaStreamCreate(&stream1);

cudaStreamCreate(&stream2);

some_gpu_function_call(…, stream1);

cudaMemcpyAsync(…, stream2);

}

func2()

{

some other stuff

}

Why can I manage to get something like this to work but not the other? Do I need to create more CPU threads to get cudaMemcpyAsync working on multiGPUs?

Thanks.

Hi,

I am pretty confused by this. In another code that uses just one GPU, I managed to successfully use cudaMemcpyAsync…

e.g.

pthread_create(…, func1, …);

pthread_create(…, func2, …);

func1()

{

cudaStream_t stream1;

cudaStream_t stream2;

cudaStreamCreate(&stream1);

cudaStreamCreate(&stream2);

some_gpu_function_call(…, stream1);

cudaMemcpyAsync(…, stream2);

}

func2()

{

some other stuff

}

Why can I manage to get something like this to work but not the other? Do I need to create more CPU threads to get cudaMemcpyAsync working on multiGPUs?

Thanks.

Sorry, I totally missed the pthread_create. You’re probably not using portable pinned memory for your host-side allocations; if you add that flag to cudaHostAlloc, that should make cudaMemcpyAsync work fine from all child threads.

Sorry, I totally missed the pthread_create. You’re probably not using portable pinned memory for your host-side allocations; if you add that flag to cudaHostAlloc, that should make cudaMemcpyAsync work fine from all child threads.

I have identified my problem. I had been using cudaMallocHost for my host-side allocations. This worked fine for single GPU but not for multi-GPU. Allocating CPU memory with cudaHostAlloc and using the cudaHostAllocPortable flag did the trick. Thanks.

I have identified my problem. I had been using cudaMallocHost for my host-side allocations. This worked fine for single GPU but not for multi-GPU. Allocating CPU memory with cudaHostAlloc and using the cudaHostAllocPortable flag did the trick. Thanks.