I know which this question is frequently asked in topics of Developers Forums, but i can’t solve my problem without make my own ask with detailed explanation.
I have make a program which that try to use a full memory of one card of my Titan-Z, the overhead of data transfers means more than 90% of process time. I’m trying offload a buffer of ~32 millions ints to GPU memory, but i can’t reach bandwidth bigger than 3 GB/s on transfer.
It’s very important explain which my code uses cudaMallocHost in a single time to allocate full memory of GPU. When i’ve used Pinned memory i’ve achieved more than ~7 GB/s in data transfer but my computation on GPU turns more expensive after, thus, the final time (computation time + data transfer) is bigger than when i use Pageable memory.
Exists another way to improve bandwith without grow of time computation on GPU? I’ve tried use cuda Streams to parallelize transfer of array chunks but my bandwidth remained the same.
My hardware especifications is following:
6 GB (single card used)
GDDR5 364 bit (of single card)
Maximum BW: 336 GB/sec
PCI-Express 3.0 x16 (diagnosed in GPU-Z)
PCI-Express 3.0 x16 (diagnosed in CPU-Z)
My Host Memory:
16 GB DDR3 (2x Dual Channel)
Thanks for everything!!!