Pinned memory is wonderful. For large data sets it reduces the data transfer from being the only significant bottleneck to merely a big bottleneck. :-)
That’s all well and good for a “lab” demonstration/benchmark where we can assume the data is optimally arranged for GPU computing. What happens when we want to integrate this with production code and the application’s data is stored in, for example, an std::vector?
The gap between application memory and pinned memory is a general problem. Obviously we can allocate some pinned memory, copy the data and download from the pinned memory. Conveniently we can get a float* ptr from a vector, so at least the copy is efficient in this case. But the copy is still “lost time” and we may not have spare memory to make another copy of the data.
A general solution to the general problem is to modify the application to (optionally) use pinned memory for data that we’ll need to move to the GPU. Whether that makes sense is a whole discussion in itself. The question here is - is it even feasible if the application uses std::vector?
It looks like (in theory) STL should allow this by implementing a custom “allocator” class, but it also looks complex and tricky to get right. Has anyone tried it?
Yes, there’s a pinned memory allocator in Thrust. Try using host_vector with pinned_allocator. I haven’t tried the allocator with std::vector, but I imagine it will work.