I hope anyone could help me with this issue, I've implemented a simple device application that modifies a float array through the device pointer. This array is allocated by cudaHostAlloc (mapped memory), and then I access it through the host pointer to check its results (I've synchronized between the CPU and GPU). the problem is that it works well when with small sizes, but when more than 30 MB are allocated it does not work well. I don't understand what am I missing here.
does it have to do with the block, grid dims, and thread size?
Thank you for your reply, but it seems like I don’t get it, doesn’t the GPU directly access the main memory in case of pinned memory, and mainly we have more main memory than device memory, thus we can have larger allocations in case of pinned memory rather than pageable memory?
Pinned memory is effectively just a contiguous host memory reservation by the GPU driver which the operating system is prevented from paging in and out of virtual memory or moving about in the real memory address space (for fragment management, garbage collection, etc). Usually the largest free continuous block of memory which is available in the address space will be considerably less that the sum of free memory, because of fragmentation, etc. I would expected the total amount of allocatable pageable memory to always be larger than pinned memory for this reason. The CUDA zero copy functionality adds what is effectively DMA to pinned memory, so that not only is memory pinned by the driver, but the GPU can write to directly it over the PCI-e bus without the need for explicit host side copy functions.
I think you’re confusing pinned memory with zero-copy. Zero copy memory is beneficial for integrated graphics cards with no dedicated memory, so they work in the same memory pool (system memory) as the CPU. In this case, using zero-copy mem avoids an extra memcopy.
Pinned memory is simply page-locked system memory. If you’re not allocating the storage in pinned memory and you upload it to the GPU, then the driver will first copy the storage into pinned memory so that it can start a DMA transfer to the GPU.
If you allocate your storage directly into pinned memory, this extra copy is not needed. That’s why it’s faster. But as the manual states, page-locked memory is a scarce resource so although you may have many gigabytes of system memory, the available amount of page-locked memory can be significantly less.