I just want a simple information. I actually want to do a FAST copy of two array on GPU for my genetic algorithm. The problem is that I have this error during the compilation :
The memcpy is not supported on the device, U can either use cudaMemcpy of type Device to Device on the host code or u can modify your memcpy in your code using threads, each threads copying one element
I use the following template function for copying anything, assuming its size is a multiple of 4 bytes:
template <typename T>
__device__ void memCopy(T *destination, T *source, int size) {
int *dest=(int *)destination;
int *src=(int *)source;
for (int tid=threadIdx.x;tid<size*sizeof(T)/4;tid+=blockDim.x)
dest[tid]=src[tid];
}
Note that although this for loop seems ugly, it is inlined, compiler knows at compile time the size of T, and (in most cases) also the size parameter. If that happens, the for loop may be simplified or even unrolled!
I assume, blockDim.y==blockDim.z==1. Otherwise more tweaking is needed to make it efficient.