I have a GTX280 with PCI-E bus.
I intended to hide the bi-directional data transfer time between host and device using two streams.
Here is the test code:
__global__ void BiDirectionalMemcpy(uint *dst0, uint *src0, uint *dst1, uint src1, uint n)
{
for (uint i = threadIdx.x; i < n; i += blockDim.x)
{
dst0[i] = src0[i];
dst1[i] = src1[i];
}
}
int main()
{
cudaSetDeviceFlags(cudaDeviceMapHost);
BiDirectionalMemcpy <<<1, 512>>>(in_d, in_mapped, out_mapped, out_d); // should use more than 1 block, but this is just an example
}
Mapped memory will be slower than device memory, but it won’t be any slower than using cudaMemcpy() on host RAM obviously.
No, there’s only 1 GPU and if the GPU doesn’t allow it, #threads won’t make a difference.
So both cudaMemcpyAsync() & BidirectionalMemcpy() overlap the transfers, despite what Tim said. I’m using Tesla 1060. Here, PCIe is much more of a bottleneck than memory bandwidth, but in general, a memory bandwidth bottleneck will prevent complete overlap.
I found I needed multiple trials in order for cudaMemcpy speed to ramp up, probably because calling cudaMemcpy for 1st time does extra initialization. BTW, your BiDirectionalMemcpy() is incorrect. Also, you should check for errors in cuda code (easiest is with CUDA_SAFE_CALL), especially when you’re allocating such big pinned arrays!
Don’t worry if you don’t get overlap. I’ve asked Tim and he insists current hardware doesn’t support it. Earlier you reported cudaMemcpyAsync didn’t overlap, which probably means the hardware doesn’t support it. Maybe NVIDIA thinks overlapped transfers are for high end applications and people who can afford $2000 Tesla boards.
I’ll test it on my GTX 260 core216 at home and see if that’s the case.
This is exactly what I’ve stated here and you ignored.
Please READ THE DETAILS when answering a non-trivial question and avoid misleading replies. This isn’t the 1st time you totally missed my question. I earlier asked here if transferring directly from a PCIe device to GPU RAM is possible and you said no period. Later, I found out it is clearly doable on Quadro boards.
Can GT200 overlap cuMemcpyDtoHAsync and cuMemcpyHtoDAsync in different streams? No. Your calculations are way off, because the maximum measured PCIe speed (for Gen2) is roughly 6GB/s. At that point, the fastest your transfer could run is 10.67ms. Which is to say, hey, no overlap!
Can GF100 overlap cuMemcpyDtoHAsync and cuMemcpyHtoDAsync in different streams? Yes, that’s a major architectural improvement.
Can GT200 overlap DtoH and HtoD transfers via zero-copy? Yes, it can.
I know exactly how the hardware is implemented, how the driver is implemented, and have looked at PCIe effects in plenty of systems. I even released a bidirectional bandwidth test a couple of weeks ago.
Whoops, my mistake. Apparently cudaMemcpyAsync() doesn’t overlap. The amount of overlap for BiDirectionalMemcpy() is much smaller than found earlier. The mistake was the code had a 2 way memory bank conflict, probably making the results less meaningful. After rewriting it here are the times:
32 MiB transfer per direction
Tavg_host2device = 6.9 ms => 4600 MiB/s
Tavg_both = 9.5 ms => 6700 MiB/s (aggregate)
Previously, I thought incorrectly that PCI 2.0 unidirection speed was 250 MByte/s since deviceBandwidthTest reports 3800 MiB/s for non-pinned memory copies, which I thought was the maximum PCI x16 speed.