I know I can make device to device memcpy from host program with cudaMemCpy.
But can i make device to device bulk copy inside a kernel. Is there any supported function(like memcpy) which we can use inside a kernel. Or looping is the only way to go?
P.S. when i use memcpy inside a kernel nvcc returns ACCESS VIOLATION with ptaxs died message.
I don’t think, that there is a function to copy memory, like you want to do it.
Why do you want to make a loop in your kernel?
Just try to configure your kernel, that there are as many threads as elements and copy one element per thread.
Perhaps you can use such a copy-kernel, when another kernel has finished.
That was exactly what I did. I think if u are stuck in a position like I am that shows that you are in the wrong direction. So I reorganized my kernel and make it “real parallel” and like u said we can call as many kernels as we like.