Advantages/Disadvantages of using pinned memory

I read this blog:

[url]http://devblogs.nvidia.com/parallelforall/how-optimize-data-transfers-cuda-cc/[/url]

but still have some basic questions about the tradeoff between using pinned/pageable memory.

After trying an application which allocates about 1GB of host memory and 1/2 GB of device memory, I see that there is a huge difference on device-host and host-device copies between using (host side) malloc vs cudaMallocHost. (about 10 times difference host-device and 4 times difference device-host).

When using pinned memory I am getting up to 13.4 GBs for a host-device and device-host copies, which is great but I am unclear on how that type of memory designation will effect resources when compared to using pageable.

In other words, if I do indeed opt to use pinned memory will it reduce any capacity on the device side of operations? Since I am using a GTX 780ti for this app, I really cannot afford to negatively impact the computational or memory resources. I will be using at least 66% of the device memory, not considering the memory used by Windows 7.

This quote talks about the tradeoffs on the host side (which I am not concerned about since there will be plenty of CPU ram and no other programs running other than basic Windows OS stuff):

“You should not over-allocate pinned memory. Doing so can reduce overall system performance because it reduces the amount of physical memory available to the operating system and other programs.”

but what about any effects in CUDA land?

The ramifications of pinning are almost entirely on the host side. The only device side impacts I can think of are:

  • faster H2D and D2H transfers
  • enables concurrency between H2D/D2H transfers, and host activity and device kernels
  • for systems where UVM is enabled (e.g. 64-bit), pinning automatically is “mapped”, meaning it takes up GPU address space. But the newer GPUs like your GTX780 have a 40-bit address space, so it shouldn’t matter unless you are pinning ~512GB of memory or more.

There shouldn’t be any “collateral” impacts on device performance or behavior.

The only real drawbacks to pinning memory are the reduction in available physical ram to the host demand-paging system, and the time it takes to pin memory (which is significantly longer than an ordinary malloc of the same size).

I have a confusion about pinned memory/pageable memory tradeoffs.

At some point, you have to copy content in pageable memory to the pinned memory you allocated. Isn’t that exactly the same as cudamemcpy transferring the pageable memory to a buffer and then to DRAM? Doesn’t the host side copy also reduce performance even though the data transfer itself is faster?

why?

instantiate your data in pinned memory

Furthermore, pinned memory is required for overlap of copy and compute. In other situations it may or may not provide a (performance) benefit. You should benchmark a comparison to see if it is better to use pinned memory or not.

If you have a very simple situation, like the vectorAdd sample code, pinned memory will probably not provide much if any benefit. The OS time cost to pin the memory offsets the speedup in transfer, since you are only using the buffers once in that case. But when there is buffer reuse, pinned memory can often be a win, performance wise.

https://devblogs.nvidia.com/how-optimize-data-transfers-cuda-cc/

My experience is very limited so maybe I just haven’t stumbled upon the right examples but seems like all the ones I see, cudamallochost some chunk of pinned memory and then memcpy the needed data into it. Not sure how to instantiate my data directly when I malloc memory?

const int ds = 10;
int *data;
cudaHostAlloc(&data, ds*sizeof(data[0]), cudaHostAllocDefault);
for (int i = 0; i < ds; i++) data[i] = i;
cudaMemcpy(d_data, data, ds*sizeof(data[0]), cudaMemcpyHostToDevice);

no buffer copying

please think about the implication of this carefully before responding “but my data is already in a buffer”.

It had to get into that buffer somehow. And that buffer had to be allocated somehow.

At the point of allocation, allocate whatever starting buffer you have as a pinned buffer, and from there on out use it exactly as you would have used your ordinary buffer.

I also up front acknowledge that this may not fit every case. I’m not suggesting that pinned buffers are a panacea, a universal drop-in replacement with no issues and automatically faster. I’m not claiming that. I’m simply trying to point out one or 2 possible methods by which you might be able to dispense with any buffer copying.

If it doesn’t fit your use case, it doesn’t fit. Again, I’m not suggesting pinned memory is a universal win. But it can be a win in some cases. It’s OK if you don’t believe me. It’s usually better for people to develop comparative test cases and benchmark things themselves. This gives the most direct answer possible.

If you’ve already concluded that in your application, the only way to use a pinned buffer is by copying data from an already-existing buffer into it, then it may not be a win. I’m not claiming or arguing that.

I am not trying to argue that you are suggesting this is a universal solution nor do I feel like I understand enough to conclude this is not for my application. I am simply trying to get a more holistic better understanding. I felt like I was missing something because it was repeatedly suggested everywhere but I couldn’t find any posts addressing my specific concerns.

In your example, you use a loop to introduce content into your variable, data. I guess I am not sure how the performance will differ (in total) between allocating pinned memory + running loop to populate + copy to device vs cudamemcpy with pageable memory which copies pageable memory to pinned and then copies to device?

Is what you are doing different from the example I attached besides using cudahostalloc instead of cudamallochost? How do they differ in terms of performance?