Hi all !
As far as i know, the current version of CUDA only lets the user request to malloc either “normal” memory or locked memory. Memory pinning is usually the common downside of user-level approach that requires high-performance memory-copying to/from a device : therefore it is not only used by GPU, but also by high-speed networking cards.
It is nowadays extremelly trendy to see people clustering nodes with GPUs inside them. Icould not experience it by myself yet, but i strongly suspect that we might have a bad time with memory-locking issues when several sub-systems do the same trick on the same data.
So here are the actual questions i was asking myself :
- Does anyone knows if in the end, for any single memory transfer, the CUDA runtime will pin/unpin memory on the fly so that it can be DMAed ? - Is there any mean by wish i could control memory locking by myself, or deferring to another sub-system, and get some interface by which i could specify that property to CUDA ?
I guess this seems unlikely as it would certainly mean that the programmer (or more likely some runtime) needs to pass some memory translation description to the CUDA runtime…
- Has anyone ever observed how CUDA and high-performance networks are interacting to that regards ? I can remember of a precise example where various networks could not be combined for an heterogeneous because both would lock memory in some incompatible way. There is no such known case of bad interaction ?
In the end, this smallish issue may become some nasty performance bottleneck : people have been strugling to reduce the overhead of memory registration for networks, it’s a little sad if we end up locking memory twice, while we could apply the very same techniques on CUDA that we did for networking (i’m thinking of registration cache for instance).
Sorry for that lengthy post,but if anyone has some insights about the question, i’d be really glad to hear about it !