faster than device to device memcpy() ?

I have a de-interlacing kernel, to split a given array into 3 individual arrays
E.g. original array: [a1, a2, a3, b1, b2, b3, c1, c2, c3,…(N*3)] --> final array: [a1, b1, c1,…(N), a2, b2, c2,…(N), a3, b3, c3,…(N)]
nothing too fancy…

But, only for some particular data-sizes the performance of the kernel ( in terms of GB/s) is consistently better than than the device-to-device memcpy
in these particular cases the performance of the memcpy drops by around 10% ?? (~61 GB/sec (in case of the Quadro))

isn’t the memcpy supposed to be the maximum possible throughput (for that particular data size) ??

further, for these particular cases…I have also tried out the 2-way split and that matches the mecmpy performance for the same sized data…
the only reason I could see why the 2-way split kernel would be slower than the 3-way split is because of warp serialization in 2-way case ??
is there warp serialization issues with the memcpy (for these sizes) ??

it would be great if any one else could confirm this and shed some light on this…
the code attached below…

thanks in advance…

edit: the graphs have been removed since they weren’t aren’t completely right, its not a linear interpolation between these points, its more of a saw-tooth kind of a curve with these points forming the kinks…
deinterlace_kernel.txt (1.4 KB)
lace_main.txt (3.65 KB)

I don’t think you’re supposed to put the cudaThreadSynchronize() calls within the for loops themselves.


Hi Nico, I don’t think that really matters…

here are the particular data sizes (and results) for which i observed this anomaly - memcpy (device to device) slowing down by ~10%

(on the Quadro 5600Fx)


N Data-Size(bytes) 3-way split(GB/s) memcpy(GB/s) 2-way split(GB/s)

65536 1572864 44.952 41.898 40.227

131072 3145728 52.473 50.536 48.711

262144 6291456 59.782 56.159 53.782

524288 12582912 62.561 56.953 57.299

1048576 25165824 65.164 60.482 58.756

2097152 50331648 65.417 61.240 59.659

4194304 100663296 66.677 61.622 60.038

8388608 201326592 66.077 61.277 60.092

16777216 402653184 66.557 61.256 60.172

33554432 805306368 66.462 61.301 60.4


these values of N can be specified in the attached code (

I ran your code multiple times for different values, and on my system (ubuntu 9.04 64 bit GF 9800GX2)
the results are always consistent around these values:

De-intrlacing into 3 arrays(0.1007Gb) Kernel- 49.686(Gb/s) Memcpy- 54.400(Gb/s)

where the last value corresponds to the bandwidth limit of the ./bandwidthTest executable.
In this system, the kernel path is never faster than the memcpy path, regardless of the value of data size.



maybe this is smthn seen only on particular devices ??

or smthn wrong with my card/code !!

i will take a look at it again…