Hello,
I have a question regarding how to transfer a scalar between two kernels in an efficient way.
Let’s say that the first kernel calculates the dot-product of two vectors.
result = dot(vecA, vecB)
__global__ void
MyDot_kernel(float* vecA_d, float* vecB_d, float* result_d, int size)
{
// Some calculation stuff...
// The result from the dot-product is finally stored in Accum[0]
if (threadIdx.x == 0) *result_d = Accum[0];
}
Now the result from MyDot_kernel() resides in global device memory.
The second kernel adds two vectors, where the second vector is muliplied with the result from the dot-product calculated in MyDot_kernel().
vecA = vecB + result_d*vecC
Now there are different options on how to pass the result from MyDot_kernel() to the second kernel.
The easiest is to provide the second kernel with the pointer to result, however this means that the second kernel needs to access global memory when accessing result_d. This kernel might look something like this:
__global__ void
AddConstMultVec(float* vecA, float* vecB, float* result_d, float* vecC)
{
y = Some corresponding index
vecA[y] = vecB[y]+ (*C) * vecC[y]
}
Another option is to copy the result_d back to the host and then pass it as an argument to the second kernel. The advantage with this method is that the second kernel can access result_d from shared memory instead of global. This method is of course not a good option, since we first need to copy the result from global memory to the host.
So my question is if there is a good way to pass single scalars between kernels without using global memory or memory copies.
Regards,
Björn