memcpy on GPU Is it possible?


I just want a simple information. I actually want to do a FAST copy of two array on GPU for my genetic algorithm. The problem is that I have this error during the compilation :

Do I really need to do a for-loop? :ermm:

Here my code, it’s simple to understand!


__device__ void crossover(char* parent1, char* parent2, char* child1, char* child2, float random_number1, float random_number2)


	//dependent on the crossover rate

	if (random_number1 < CROSSOVER_RATE) {

		//create a random crossover point

		int crossover_point = (int) (random_number2 * CHROMO_LENGTH);

		memcpy(child1, parent1, crossover_point * sizeof(char));

		memcpy(child1 + crossover_point, parent2 + crossover_point, (CHROMO_LENGTH - crossover_point) * sizeof(char));

		memcpy(child2, parent2, crossover_point * sizeof(char));

		memcpy(child2 + crossover_point, parent1 + crossover_point, (CHROMO_LENGTH - crossover_point) * sizeof(char));			  


	else {

		memcpy(child1, parent1, CHROMO_LENGTH * sizeof(char));

		memcpy(child2, parent2, CHROMO_LENGTH * sizeof(char));



The memcpy is not supported on the device, U can either use cudaMemcpy of type Device to Device on the host code or u can modify your memcpy in your code using threads, each threads copying one element

Dlm is right.

I use the following template function for copying anything, assuming its size is a multiple of 4 bytes:

template <typename T>

__device__ void memCopy(T *destination, T *source, int size) {

	int *dest=(int *)destination;

	int *src=(int *)source;

	for (int tid=threadIdx.x;tid<size*sizeof(T)/4;tid+=blockDim.x)



Note that although this for loop seems ugly, it is inlined, compiler knows at compile time the size of T, and (in most cases) also the size parameter. If that happens, the for loop may be simplified or even unrolled!

I assume, blockDim.y==blockDim.z==1. Otherwise more tweaking is needed to make it efficient.