I am using 2 GPUs with pthreads. I am trying to use CudaMemcpyAsync from host to device for both of the GPUs (different CPU data) via cudaStreams but this doesn’t seem to work. The code works fine when I replace CudaMemcpyAsync with CudaMemcpy.
Can we use cudamemcpyasync with multi-GPUs? If so, what might be causing my problem? If not, why can’t we use asynchronous copies with more than 1 GPU? Thanks.
I am pretty confused by this. In another code that uses just one GPU, I managed to successfully use cudaMemcpyAsync…
e.g.
pthread_create(…, func1, …);
pthread_create(…, func2, …);
func1()
{
…
cudaStream_t stream1;
cudaStream_t stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
some_gpu_function_call(…, stream1);
cudaMemcpyAsync(…, stream2);
…
}
func2()
{
some other stuff
}
Why can I manage to get something like this to work but not the other? Do I need to create more CPU threads to get cudaMemcpyAsync working on multiGPUs?
I am pretty confused by this. In another code that uses just one GPU, I managed to successfully use cudaMemcpyAsync…
e.g.
pthread_create(…, func1, …);
pthread_create(…, func2, …);
func1()
{
…
cudaStream_t stream1;
cudaStream_t stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
some_gpu_function_call(…, stream1);
cudaMemcpyAsync(…, stream2);
…
}
func2()
{
some other stuff
}
Why can I manage to get something like this to work but not the other? Do I need to create more CPU threads to get cudaMemcpyAsync working on multiGPUs?
Sorry, I totally missed the pthread_create. You’re probably not using portable pinned memory for your host-side allocations; if you add that flag to cudaHostAlloc, that should make cudaMemcpyAsync work fine from all child threads.
Sorry, I totally missed the pthread_create. You’re probably not using portable pinned memory for your host-side allocations; if you add that flag to cudaHostAlloc, that should make cudaMemcpyAsync work fine from all child threads.
I have identified my problem. I had been using cudaMallocHost for my host-side allocations. This worked fine for single GPU but not for multi-GPU. Allocating CPU memory with cudaHostAlloc and using the cudaHostAllocPortable flag did the trick. Thanks.
I have identified my problem. I had been using cudaMallocHost for my host-side allocations. This worked fine for single GPU but not for multi-GPU. Allocating CPU memory with cudaHostAlloc and using the cudaHostAllocPortable flag did the trick. Thanks.