Prefetch to multiple GPUs?

I have an application that uses unified memory for a large read-only reference array that will be used on multiple GPUs. Can I use cudaMemPrefetchAsync to prefetch the data to the multiple GPUs before I start my realtime loop? i.e. will each GPU get a copy of the data, or will only the last GPU I prefetch to get it?

Quoting the API doc for cudaMemPrefetchAsync:

By default, any mappings to the previous location of the migrated pages are removed and mappings for the new location are only setup on dstDevice. The exact behavior however also depends on the settings applied to this memory range via cudaMemAdvise as described below:

  • If cudaMemAdviseSetReadMostly was set on any subset of this memory range, then that subset will create a read-only copy of the pages on dstDevice.`

Since you already have an implementation, it should be simple to compare the metrics for both cases

Thanks, I should have spotted that section in the guide. That should do exactly what I need!