Simple question: memcpy inside a kernel?

I think this is a simple question. My kernel needs to copy values from an memory buffer to another memory buffer (integer buffers). I would like to know the most efficient way to do this. If there would be some sotr of memcpy that I could execute inside my kernel I would be happy.


Hi, you can’t use cudaMemcpy() in the kernel, but you can use cudaMemcpy() with cudaMemcpyDevicetoDevice as the last parameter in your host code. This is the fastest way.
If you absolutely need to do it in your kernel, then you will have to read each variable separately from global memory and write it to the new buffer. I’d vote for cudaMemcpy().

Thanks for answering. The problem is that I would like to parallelize this copies so that each kernel copies different parts of an array to another. And I would like to do it in the most efficient way possible.

I can not understand why there’s no way to copy buffers efficiently inside the kernel. Doing it with for’s is a great pain.

One more question: if I declared my uint array as uint4 array I would be able to copy 4 uints with a single iteration right? So I would need 4 times less iterations, woudn’t I?

1 Like

Because there are countless ways for you to setup your block/grid configuration and even more ways for the data you want to be copied to be represented. There cannot possibly be a single memcpy function that will efficiently copy memory in all cases.

You could. But then you’d have 4 times less threads which could potentially lead to worse scaling on newer GPUs. Normal operating procedure in CUDA is to have each single thread read a single data element to make the most out of the parallel memory hardware.


I have a similar problem as yours. What’s your solution to your problem? Did you find a more efficient way?




Though Mr_Nuke mentioned that we can use the cudaMemcpy() function in the kernels, in the current API this function is stated as an host function. Was it changed recently or is there another way of doing memcopy inside the kernel?

reread his answer - hi said opposite

You can use memcpy() inside a kernel.

For devices that can use the device runtime API (covered in the programming guide) you can also use cudaMemcpy inside a kernel.