I have a de-interlacing kernel, to split a given array into 3 individual arrays
E.g. original array: [a1, a2, a3, b1, b2, b3, c1, c2, c3,…(N*3)] → final array: [a1, b1, c1,…(N), a2, b2, c2,…(N), a3, b3, c3,…(N)]
nothing too fancy…
But, only for some particular data-sizes the performance of the kernel ( in terms of GB/s) is consistently better than than the device-to-device memcpy
in these particular cases the performance of the memcpy drops by around 10% ?? (~61 GB/sec (in case of the Quadro))
isn’t the memcpy supposed to be the maximum possible throughput (for that particular data size) ??
further, for these particular cases…I have also tried out the 2-way split and that matches the mecmpy performance for the same sized data…
the only reason I could see why the 2-way split kernel would be slower than the 3-way split is because of warp serialization in 2-way case ??
is there warp serialization issues with the memcpy (for these sizes) ??
it would be great if any one else could confirm this and shed some light on this…
the code attached below…
thanks in advance…
edit: the graphs have been removed since they weren’t aren’t completely right, its not a linear interpolation between these points, its more of a saw-tooth kind of a curve with these points forming the kinks…
deinterlace_kernel.txt (1.4 KB)
lace_main.txt (3.65 KB)