Hello, I encountered an interesting behaviour while programming matrix/vector operations in a loop. I read several posts here and on StackOverflow, but none answered fully what I am searching for, some referred to cudaMemcpy to take lot of time due to some checking of data integrity in global memory. I use some very basic parallel algorythms, such as:
__global__ void MxV(float vec[], float mat[], float out[) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
float sum = 0;
if (tid < n) {
for (int i = 0; i < n; i++)
sum += vec[i] * mat[i + (n * tid)];
out[tid] = sum;
}
}
__global__ void VxV(float vec1[), float vec2[], float out[]) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < n) {
out[tid] = vec1[tid] * vec2[tid];
}
}
__global__ void VaddV(float a1, float vec1[], float a2, float vec2[], float out[]) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid < n) {
out[tid] = a1 * vec1[tid] + a2 * vec2[tid];
}
}
The loop on image below calls mostly matrix and vector multiplications. The whole loop that I have takes in average 245ms and when run line by line I discovered, that the first call of cudaMemcpy in every loop iteration is taking most of the loop run time (95%). The second identical call of cudaMemcpy takes nearly no time at all.
I am building a small program for iterative solving of linear equations and the speed is super important to me. If possible, I would like to avoid that the most of the time is spent on transfer between cuda memory and RAM, so I tried alternatively to build the loop entirely without any passing of data:
for (int i=0; i<300;i++)
{
MxV << <n / GPU_threads + 1, GPU_threads >> > (dev_p, dev_A, dev_Ap);
VdotV << <n / GPU_threads + 1, GPU_threads >> > (dev_p, dev_Ap, dev_denom);
divideScalars << <n / GPU_threads + 1, 1 >> > (dev_nom, dev_denom, dev_alfa);
assignScalars << <n / GPU_threads + 1, GPU_threads >> > (dev_nom, dev_denom);
VaddV2 << <n / GPU_threads + 1, GPU_threads >> > (1.0, dev_x, dev_alfa, dev_p, dev_x);
VaddV2 << <n / GPU_threads + 1, GPU_threads >> > (1.0, dev_r, dev_alfa, dev_Ap, dev_r);
VdotV << <n / GPU_threads + 1, GPU_threads >> > (dev_r, dev_r, dev_nom);
divideScalars << <n / GPU_threads + 1, 1 >> > (dev_nom, dev_denom, dev_beta);
VaddV2 << <n / GPU_threads + 1, GPU_threads >> > (-1.0, dev_r, dev_beta, dev_p, dev_p);
}
What I do not understand is, that the speed of the 2nd loop is only < 1% faster than in the case of the loop where I identified the bottleneck to be the cudaMemcpy. Is there a reason for that? What am I missing?
Thank you for any hints,
Vladimir