I know I can make device to device memcpy from host program with cudaMemCpy.
But can i make device to device bulk copy inside a kernel. Is there any supported function(like memcpy) which we can use inside a kernel. Or looping is the only way to go?
P.S. when i use memcpy inside a kernel nvcc returns ACCESS VIOLATION with ptaxs died message.
That was exactly what I did. I think if u are stuck in a position like I am that shows that you are in the wrong direction. So I reorganized my kernel and make it “real parallel” and like u said we can call as many kernels as we like.