Passing a scalar between kernels Which is the most efficient way?


I have a question regarding how to transfer a scalar between two kernels in an efficient way.

Let’s say that the first kernel calculates the dot-product of two vectors.

result = dot(vecA, vecB)

__global__ void

MyDot_kernel(float* vecA_d, float* vecB_d, float* result_d, int size)


  // Some calculation stuff...


  // The result from the dot-product is finally stored in Accum[0]

  if (threadIdx.x == 0) *result_d = Accum[0];


Now the result from MyDot_kernel() resides in global device memory.

The second kernel adds two vectors, where the second vector is muliplied with the result from the dot-product calculated in MyDot_kernel().

vecA = vecB + result_d*vecC

Now there are different options on how to pass the result from MyDot_kernel() to the second kernel.

The easiest is to provide the second kernel with the pointer to result, however this means that the second kernel needs to access global memory when accessing result_d. This kernel might look something like this:

__global__ void

AddConstMultVec(float* vecA, float* vecB, float* result_d, float* vecC)


   y = Some corresponding index

   vecA[y] = vecB[y]+ (*C) * vecC[y]


Another option is to copy the result_d back to the host and then pass it as an argument to the second kernel. The advantage with this method is that the second kernel can access result_d from shared memory instead of global. This method is of course not a good option, since we first need to copy the result from global memory to the host.

So my question is if there is a good way to pass single scalars between kernels without using global memory or memory copies.



How about copying from global memory to constant memory? This is a device-to-device copy, which would be faster than device-to-host. Constant memory is also cached, making it more efficient to read than global memory.

If the number of scalars is small, you may not see much performance difference between this passing it to the kernel via the argument.

You could try texture. At least in my last test, global load in all threads from the same address is very slow, but texture fetch from the same address is very fast. That also avoids a memcpy (~2us in driver even for a 4-byte one).

Thanks guys,
I will give both constant memory and texture fetching a try to see if I can spot any performance difference.