I need to decide whether to port a simple fp32 vector euclidean distance calculation from cpu to gpu. Assuming that I can preload one of the vectors into the gpu and have to load the second during runtime. What number of elements is the gpu likely to be faster than the cpu. (I am using RTX 2080 Ti as the gpu and 3GHz Intel machine as the cpu).
for (i=0; i < elements; i++)
{
diff = *(vector1++) - *(vector2++);
sum += diff*diff;
}