memcpy on GPU Is it possible?

helmoz · May 27, 2009, 8:49pm

Hi,

I just want a simple information. I actually want to do a FAST copy of two array on GPU for my genetic algorithm. The problem is that I have this error during the compilation :

Do I really need to do a for-loop? External Media

Here my code, it’s simple to understand!

Thanks

__device__ void crossover(char* parent1, char* parent2, char* child1, char* child2, float random_number1, float random_number2)

{

	//dependent on the crossover rate

	if (random_number1 < CROSSOVER_RATE) {

		//create a random crossover point

		int crossover_point = (int) (random_number2 * CHROMO_LENGTH);

		memcpy(child1, parent1, crossover_point * sizeof(char));

		memcpy(child1 + crossover_point, parent2 + crossover_point, (CHROMO_LENGTH - crossover_point) * sizeof(char));

		memcpy(child2, parent2, crossover_point * sizeof(char));

		memcpy(child2 + crossover_point, parent1 + crossover_point, (CHROMO_LENGTH - crossover_point) * sizeof(char));			  

	}

	else {

		memcpy(child1, parent1, CHROMO_LENGTH * sizeof(char));

		memcpy(child2, parent2, CHROMO_LENGTH * sizeof(char));

	}

}

dlmeetei · May 28, 2009, 3:43am

Hi,

I just want a simple information. I actually want to do a FAST copy of two array on GPU for my genetic algorithm. The problem is that I have this error during the compilation :

Do I really need to do a for-loop? External Media

Here my code, it’s simple to understand!

Thanks

__device__ void crossover(char* parent1, char* parent2, char* child1, char* child2, float random_number1, float random_number2)

{

	//dependent on the crossover rate

	if (random_number1 < CROSSOVER_RATE) {

		//create a random crossover point

		int crossover_point = (int) (random_number2 * CHROMO_LENGTH);

		memcpy(child1, parent1, crossover_point * sizeof(char));

		memcpy(child1 + crossover_point, parent2 + crossover_point, (CHROMO_LENGTH - crossover_point) * sizeof(char));

		memcpy(child2, parent2, crossover_point * sizeof(char));

		memcpy(child2 + crossover_point, parent1 + crossover_point, (CHROMO_LENGTH - crossover_point) * sizeof(char));			  

	}

	else {

		memcpy(child1, parent1, CHROMO_LENGTH * sizeof(char));

		memcpy(child2, parent2, CHROMO_LENGTH * sizeof(char));

	}

}

The memcpy is not supported on the device, U can either use cudaMemcpy of type Device to Device on the host code or u can modify your memcpy in your code using threads, each threads copying one element

Sarnath · May 28, 2009, 5:02am

Dlm is right.

Cygnus_X1 · May 28, 2009, 1:24pm

I use the following template function for copying anything, assuming its size is a multiple of 4 bytes:

template <typename T>

__device__ void memCopy(T *destination, T *source, int size) {

	int *dest=(int *)destination;

	int *src=(int *)source;

	for (int tid=threadIdx.x;tid<size*sizeof(T)/4;tid+=blockDim.x)

		dest[tid]=src[tid];

}

Note that although this for loop seems ugly, it is inlined, compiler knows at compile time the size of T, and (in most cases) also the size parameter. If that happens, the for loop may be simplified or even unrolled!

I assume, blockDim.y==blockDim.z==1. Otherwise more tweaking is needed to make it efficient.