Ok, by now I realized that searching for “pinned memory” restrictions is much more informative thatn searching for “large array” + CUDA buzzwords.
So I have to reformulate my question:
What can I do to make this as fast as possible?
I have 4 6GB Teslas to fill.
But (if I understood correctly) without pinned Mem I can neither copy async nor can I transfer at full 16x Speed.
Any pain relief available? Are there ways to extend the amount of pinned MEM available?
Markus
I will test that and report back. I’m not sure whether I manage to test this over the weekend.
Do you mean I should never use WC or I should never use MEM that is not WC?
I decided for WC as I will fill the MEM by CPUs only a few times and then will consume by GPUs only.
So it’s more or less an input buffer.
Honestly I’m not sure whether to make it mapped and write combined or to make it non-WC only pinned.
Option 3 would be “to use CUDA 4.0 interface” and register it and so on. But some intuition tells me that the hand-tuned options should be better suited for the actual expected access patterns than the “universal” choice implemented using some hidden mojo.
I’m facing the same problems as you. I wanna init pinned memory as 512MB * 4, so that I could use functions like cudaMemcpyAsync. But it went wrong when “cudaHostAlloc” the 4th pinned memory.
I’ve searched a lot in google, but still can’t find the solution. Is it an NV cuda’s bug?? Have you fixed it?
I eventually found out that my GPU(quadro fx 4800) could allocate about 1.5G pinned memory at large. It will return CudaError if allocating more pinned memory.