faster than device to device memcpy() ?

NCC-1701D · June 23, 2009, 2:08pm

I have a de-interlacing kernel, to split a given array into 3 individual arrays
E.g. original array: [a1, a2, a3, b1, b2, b3, c1, c2, c3,…(N*3)] → final array: [a1, b1, c1,…(N), a2, b2, c2,…(N), a3, b3, c3,…(N)]
nothing too fancy…

But, only for some particular data-sizes the performance of the kernel ( in terms of GB/s) is consistently better than than the device-to-device memcpy
in these particular cases the performance of the memcpy drops by around 10% ?? (~61 GB/sec (in case of the Quadro))

isn’t the memcpy supposed to be the maximum possible throughput (for that particular data size) ??

further, for these particular cases…I have also tried out the 2-way split and that matches the mecmpy performance for the same sized data…
the only reason I could see why the 2-way split kernel would be slower than the 3-way split is because of warp serialization in 2-way case ??
is there warp serialization issues with the memcpy (for these sizes) ??

it would be great if any one else could confirm this and shed some light on this…
the code attached below…

thanks in advance…

edit: the graphs have been removed since they weren’t aren’t completely right, its not a linear interpolation between these points, its more of a saw-tooth kind of a curve with these points forming the kinks…
deinterlace_kernel.txt (1.4 KB)
lace_main.txt (3.65 KB)

Nico · June 23, 2009, 2:17pm

I don’t think you’re supposed to put the cudaThreadSynchronize() calls within the for loops themselves.

N.

NCC-1701D · June 24, 2009, 5:01am

Hi Nico, I don’t think that really matters…

here are the particular data sizes (and results) for which i observed this anomaly - memcpy (device to device) slowing down by ~10%

(on the Quadro 5600Fx)

[codebox]

N Data-Size(bytes) 3-way split(GB/s) memcpy(GB/s) 2-way split(GB/s)

65536 1572864 44.952 41.898 40.227

131072 3145728 52.473 50.536 48.711

262144 6291456 59.782 56.159 53.782

524288 12582912 62.561 56.953 57.299

1048576 25165824 65.164 60.482 58.756

2097152 50331648 65.417 61.240 59.659

4194304 100663296 66.677 61.622 60.038

8388608 201326592 66.077 61.277 60.092

16777216 402653184 66.557 61.256 60.172

33554432 805306368 66.462 61.301 60.4

[/codebox]

these values of N can be specified in the attached code ( lace_main.cu)

Nico · June 24, 2009, 8:22am

I ran your code multiple times for different values, and on my system (ubuntu 9.04 64 bit GF 9800GX2)
the results are always consistent around these values:

De-intrlacing into 3 arrays(0.1007Gb) Kernel- 49.686(Gb/s) Memcpy- 54.400(Gb/s)

where the last value corresponds to the bandwidth limit of the ./bandwidthTest executable.
In this system, the kernel path is never faster than the memcpy path, regardless of the value of data size.

N.

NCC-1701D · June 24, 2009, 10:34am

hmm…interersting,

maybe this is smthn seen only on particular devices ??

or smthn wrong with my card/code !!

i will take a look at it again…

Topic		Replies	Views
upper limit for memory bandwidth on the device ? CUDA Programming and Performance	13	11442	July 8, 2009
Device to Device cudaMemcpy performance CUDA Programming and Performance cuda	5	11436	March 24, 2021
Memory copy by two CUDA kernels - why speed differs? CUDA Programming and Performance	10	764	September 28, 2018
Fast vs. Slow memcpy Trying to understand GPU I/O via memcpy CUDA Programming and Performance	8	14678	May 5, 2011
my speedy Memcpy() CUDA Programming and Performance	9	15066	January 5, 2009
Slow cudaMemcpy execution Tested in GTX480 and GT240 CUDA Programming and Performance	6	2333	April 25, 2012
Possibly Studpid question bout cudaMemcpy CudaMemcpy getting slow by time CUDA Programming and Performance	4	2097	February 26, 2010
cudaMemcpyDeviceToHost 3x slower than cudaMemcpyHostToDevice CUDA Programming and Performance	1	972	January 9, 2019
cudaMemcpy host->device and device->host speed CUDA Programming and Performance	6	15495	April 29, 2014
Device to Host memcpy How do i make this faster? CUDA Programming and Performance	2	2572	February 6, 2008

faster than device to device memcpy() ?

Related topics