I need to decide whether to port a simple fp32 vector euclidean distance calculation from cpu to gpu. Assuming that I can preload one of the vectors into the gpu and have to load the second during runtime. What number of elements is the gpu likely to be faster than the cpu. (I am using RTX 2080 Ti as the gpu and 3GHz Intel machine as the cpu).
for (i=0; i < elements; i++)
{
diff = *(vector1++) - *(vector2++);
sum += diff*diff;
}
(1) Copy the source data (vector2) to the GPU
(2) Compute the results on the GPU
(3) Copy the results back to the host
this will not be faster than simply processing on the CPU, for vectors of any length. The reason is that the copy overhead (across a PCIe link with max about 12.5 GB/sec throughput) will be larger than the compute time on the CPU. FWIW, any reasonable compiler should be able to SIMD vectorize the code you have shown.
If your processing includes other processing steps that already occurs on the GPU, that could be a different scenario.
For context: What typical vector lengths do you expect to handle?