I’m working on a multi-gpu simulation that starts a pthread for each GPU and I need to pass some data (~28kb) across 2-3 GPUs. The main() thread doens’t manage any GPU. I have a pc with 3 GTX280 and another with a GTX280 and a GTX480.
The first and simplest thing I did was to copy the data on a host buffer and then copy back where I need.
Then I tried to allocate the host buffer as a pinned buffer with cudaHostAlloc() with flags (cudaHostAllocPortable | cudaHostAllocWriteCombined). I had a little ~5% speedup in memcpys.
Then I’d like to see if zero-copy buffers would speedup the trasfers. Note that CPU doesn’t need to read nor write the buffer; GPUs are producers and consumers of those data (it’s an overlapping stripe of the partitions of my domain).
So what I tried is:
Check for deviceprops.canMapHostMemory
cudaSetDeviceFlags(cudaDeviceMapHost) at the beginning of each thread (no error is returned)
Allocation: cudaHostAlloc() alloc with flags (cudaHostAllocMapped | cudaHostAllocPortable | cudaHostAllocWriteCombined), once
cudaHostGetDevicePointer() of the host buffer, once for each GPU ot get device pointers of the mapped area
My first structure was: only the first thread allocates host pinned memory; there is barrier; after the barrier each thread gets its device pointer. Unfortunately, id didn’t work: I got an “unknown error” when calling cudaHostAlloc().
The “pinned memory APIs” white paper states that Portable pinned memory works both for contexts that predate the allocation, and for contexts that are created after the allocation has been performed, so I was pretty sure the problem was not due to host allocation before other context were created; however, I tried allocating the buffer only in the last thread, after all context were created, and I got the same error.
Another attempt was to make the main thread allocate the pinned buffer after all threads were launched. There was a barrier (pthread signals & wait) for the threads to wait for the main thread allocating the buffer before trying to get the device pointer. This failed too, but with an “invalid argument” returned by cudaHostGetDevicePointer().
Finally, I just moved the allocation in the main thread before all threads were created. This worked!
Now I have weird data in the simulation, maybe there’s a my mistake in indexing or data coherence / write conflicts (while the buffer is divided in half to have a double buffer and avoid conflicts with barriers). But I don’t understand some discrepancies with respect to the documentation.
First, there is written that I can alloc mapped portable memory before or after any context is created, and this was not true in my case.
Second, I got an “unknown error” until I moved the allocation in the main thread, the only that doesn’t handle any GPU!
Am I doing something wrong? Maybe I interpreted wrong? And, more in general, is this the right choice to pass data across GPUs?