I have a multi-GPU application that uses a thread pool and a memory pool. Each of the threads controls kernels on one of the GPUs, and the memory comes from common pool. The application works fine if I don’t try to do zero-copy.
If I replace the memory pool with one allocated using cudaHostAlloc, and if I replace cudaMemcopy, and cudaMalloc with cudaGetDevicePointer, cudaHostGetDevicePointer fails with invalid argument (cudaErrorInvalidValue to be precise).
Before calling cudaHostAlloc, I call cudaSetDevice and cudaSetDeviceFlags(cudaDeviceMapHost).
cudaHostAlloc was invoked with cudaHostAllocMapped | cudaHostAllocPortable
I wonder if zero-copy works in this situation, or I have to create a memory pool per thread.
I am using 2 X Tesla C1060 and one Quadro FX 370. OS: openSUSE 11.0, kernel 126.96.36.199-0.1, nvidia driver 185.18.08-beta, cuda toolkit 2.2
Very interesting as we will be making a multi-GPU app like this in the near future. A memory region on the host that is read-only and should be accessed by all GPUs. And N regions that are read-write and allocated to one GPU.
I don’t think the Quadro FX 370 supports zero-copy - check the canMapHostMemory member of the device properties structure.
cudaHostGetDevicePointer() should succeed for the C1060’s. cudaHostGetDevicePointer() is the failure point because portable allocations may map the host memory into some devices’ address spaces and not in others.