fine control of memory pinning in CUDA

Hi all !

As far as i know, the current version of CUDA only lets the user request to malloc either “normal” memory or locked memory. Memory pinning is usually the common downside of user-level approach that requires high-performance memory-copying to/from a device : therefore it is not only used by GPU, but also by high-speed networking cards.

It is nowadays extremelly trendy to see people clustering nodes with GPUs inside them. Icould not experience it by myself yet, but i strongly suspect that we might have a bad time with memory-locking issues when several sub-systems do the same trick on the same data.

So here are the actual questions i was asking myself :

 - Does anyone knows if in the end, for any single memory transfer, the CUDA runtime will pin/unpin memory on the fly so that it can be DMAed ? 

 - Is there any mean by wish i could control memory locking by myself, or deferring to another sub-system, and get some interface by which  i could specify that property to CUDA ?

I guess this seems unlikely as it would certainly mean that the programmer (or more likely some runtime) needs to pass some memory translation description to the CUDA runtime…

 - Has anyone ever observed how CUDA and high-performance networks are interacting to that regards ? I can remember of a precise example where various networks could not be combined for an heterogeneous because both would lock memory in some incompatible way. There is no such known case of bad interaction ?

In the end, this smallish issue may become some nasty performance bottleneck : people have been strugling to reduce the overhead of memory registration for networks, it’s a little sad if we end up locking memory twice, while we could apply the very same techniques on CUDA that we did for networking (i’m thinking of registration cache for instance).

Sorry for that lengthy post,but if anyone has some insights about the question, i’d be really glad to hear about it !

Regards,
Cédric

if memory is not allocated as pinned memory, CUDA will set up a pinned buffer, copy the memory there, the device will DMA from that pinned buffer.

I somehow expected something like that, but thanks for that info !

Given that it is not always feasible to use cuda functions to allocate memory, that’s really bad for our cpu usage (which is a little silly when it comes to DMAing 100MB of memory). Of course this may save some syscalls but given that they do not cost anything on modern processors, i guess avoiding that copy for large enough buffers does make full sense, let us hope this may be done in the future ;)

Why is it not always feasible to allocate memory with cudamallochost? Probably because other devices want to allocate with their special functions also?

BTW, I did not find this out, wumpus posted this info. It is smart to search for his posts on the forum, combined they form a low-level-detail FAQ. :thumbup:

Ok i’ll look for those posts, you might have understood i’m rather interested in the rather low-level side :)

Indeed as you pointed out, i’m using CUDA as one solution among other, when it comes to GPGPUs it is safe to keep the heterogeneous approach not rely on a specific constructor method. Of course i could make everything possible to use those functions, but it would seriously impact the programming model as it may become incompatible with an heterogeneous platform.

I can also think of weird situations where you are not using “simple” memory regions that you allocate using a malloc, for instance when you are mapping some file or even some shared memory. That’s not that rare situations.

As a personnal point of view, i do not think this is CUDA duty to allocate memory on the host, and it would be much more reasonnable/handy to simply set properties on a memory region/buffer. But i acknowledge this lets/opens many hard problems to be solved to make sure that the memory mapping remains valid…

Thanks for your suggestions and your time

Cédric

Hey, I am with you. I am going to have to shuffle my CUDA code into a very large in-house realtime framework… Luckily some control-parts are not ready yet ;)

There is a wishlist thread where you could post this (maybe it is already there) What I read about 2.0 so far gives me the feeling that NVIDIA is looking at that list.

Maybe after the 2.0 release is out it is smart to start another wishlist thread to give NVIDIA a single place to look at for ideas for enhancements.

In at least one MPI/Infiniband library combo, the IB RDMA mechanism could fail when operating on a pinned buffer allocated by cudaMallocHost(). So in short, yes, you can have trouble. I can’t really say more than that as I haven’t done much testing on this myself yet, but I know this exact situation has cropped up in at least one case so far. I should know more in a week or two after I do more testing myself. The easiest workaround for this situation is to avoid using the pinned memory in one of the multiple APIs one is using (e.g. in CUDA, in MPI/InfiniBand).

Cheers,

John Stone

Thanks for that info, at least this shows i was not completely paranoid !

Of course to avoid bad interaction we may avoid having both subsystem to pin; but given how crucial this is for both bandwith and cpu usage … (noting that many people do not hesistate to waste cpu with their IB network, i will not flame, but that’s not how i conceive HPC :)).

Depending on the structure of one’s parallel code, it may be necessary to reformat, gather, or otherwise copy data from incoming IB RDMA buffers into a different buffer that gets sent to the GPU anyway. In these cases (our application NAMD is one such example) one doesn’t have to worry about the IB layer and CUDA fighting over VM mappings for the same page of memory. In our case, we have to do copies and reformatting of data before it gets sent to the GPU, something that’s unavoidable for us, as we have to gather together enough incoming data to be able to keep the GPU busy. For some other codes this may not be the case, and I could imagine that in such cases the need for using unpinned allocations or extra copies could be a performance hit, but I suspect that for most things that benefit greatly from the GPU, this would primarily be an issue on larger runs where the communication interconnect is the major performance bottleneck. We’ll have to see how these things develop. I’ll post my results when I do more testing on this stuff.

Cheers,
John Stone

I hope this feature very much. In the cuda plugin for matlab, it seems possible to pin a matlab array during the mex call via virtuallock function, but you cannot tell cuda that it has been pinned.

Ok i guess that it now seems that there is some signs that on the one hand some issues may have appeared already, but that the obvious workaround using memory copies / not pinning in either cuda or some other subsystem (eg. the ib driver) should help getting rid of that, and that it may be ok for small setups.

On the other hand it seems that a find control of pinning would be desirable, at least for HPC where you can’t afford to waste that much (or at least we should struggle to get rid of that “useless” overhead). Of course i’m still convinced it is something we need at least on the long term.

Also, i guess it may be worth noting some of the downside of such a proposition (which may explain why we do not have it yet). Pinning makes two main things,

  • First something like a get_user_page or so, which actually makes sure the page won’t be swapped out or so, there is no big deal about that.

  • Second, it stores the translation of all the pages, which starts to really make the problem hard : if the user wants to specify that some other driver already pinned the memory, what should it do ? I guess that provided we do not have access to the way cuda remembers such memory translations, it is hopeless that we will ever have a way to supply some translation. This should not be that big a performance tradeoff to have the translation computed once again.

While CUDA is the only user of the memory it allocated with its interface, there could be some cases where the user could change the mapping, possibly by mistake (eg. map a file A1 on buffer B, pin B, tell cuda that B was pinned, unmap A1 and map file A2 instead, at the same location). This is really a problem as it’s almost impossible to detect such memory mapping change from the cuda user-apps driver, since it has no clue of those map,unmap,free,malloc etc… things. There is certainly several possible approaches : “use it at your own risk”, or detect those changes by using glibc hooks for instance, but this may interfere with some applications that do not appreciate thoses hooks.

The question is wether using such a “risky” interface for memory registration would threaten only the user process or the entire system. High speed networking drivers seems to be using glibc hooks as it is almost the only possible solution (unless we have some notifiers directly inside the kernel), but those usually do not care about security issues.

Personnaly i’m ok with that provided i always targeted HPC problems, but i know this cannot make a consensus. So in the end, if that fine control of memory pinning feature is made available, it is pretty sure that it should be especially denoted as something “dangerous” to be manipulated with care.

Just my 2cents,

Cédric

I’ve been on the road for a few days and I’m just now catching up…
FYI, the issue with pinned memory has come up on two different clusters with Infiniband now (different MPI libraries: OpenMPI, and MVAPICH2). In the case of OpenMPI you can use a flag to disable memory pinning. I don’t know the flag, but it’s conceivable that the same is possible for MVAPICH2 And of course, you can disable the use of pinned memory in your CUDA code.

Cheers,
John Stone

The flag for OpenMPI is “–mca btl_openib_flags 1”.
You can pass it as an argument in mpirun.