Multiple processes polling one GPU

As the application I’m working on is currently set up, every MPI process on an SMP node needs to send requests to a single GPU.

In this app, each MPI process selects a section of a global worklist based on its rank and processes it. However, it appears that letting all 4 local MPI processes poll a C1060 is much more expensive than just letting a single core do the polling (by giving one core all the work and telling the others to idle).

I’ve been warned against doing this, and I see there is quite a bit of discussion on the forum where other people have seen the same thing. However, what are the costs of this context switching? Changing the worklists around in this app shouldn’t be an issue (and they’ll have to change anyway), so we aren’t really desperate for a fix, but is this an issue that might go away in the future?

Ben