Problems with cudaMemcpy

Hello,

I am parallelizing a self-written PSO Code. That’s a code fragment which
unfortunately does not work.

C_struct_Particle * C_struct_Swarm_optimize(C_struct_Swarm * s) {
	C_struct_Swarm *gpu__s;

	float *f = (float *)malloc(sizeof(f));
	float *gpu__f;

	static unsigned int gpuBytes = sizeof(C_struct_Particle *)
			+ ((1024 * 1024) * sizeof(C_struct_Particle *));
	

	static unsigned int gpuBytes_f = sizeof(*f);

	CUDA_SAFE_CALL(cudaMalloc(((void * *) (&gpu__s)), gpuBytes));

	printf("After cuda Malloc calculated sizeof = %ld\n", gpuBytes);
	printf("After cuda Malloc calculated sizeof(s) =  %ld\n", sizeof(s));
	printf("After cuda Malloc calculated sizeof(*s) =  %ld\n", sizeof(*s));

	CUDA_SAFE_CALL(cudaMalloc(((void * *) (&gpu__f)), gpuBytes_f));

	for (j = 0; j < 20; 20; j++) {

		int err = cudaMemcpy(gpu__s, s, gpuBytes, cudaMemcpyHostToDevice);
		printf("cudaMemcpy err code = %d\n", err);

		err = cudaMemcpy(gpu__f, f, gpuBytes_f, cudaMemcpyHostToDevice);
		printf("cudaMemcpy for FLOAT +++ err code = %d\n", err);

		C_struct_Swarm_optimize_kernel0<<<dimGrid0, dimBlock0, 0, 0>>>();

	}
	return 0;
}

In the first and second(!) iteration both cuda Memcopies work with error code 0. In the next 18 iterations they return error code 4. If I comment out the kernel call "C_struct_Swarm_optimize_kernel0<<<dimGrid0, dimBlock0, 0, 0>>>();", the cuda memcopies work in every(!) iteration for both variables.

I know that this code is incomplete, but I just don’t understand why the errors appear in cudaMemcpy.

Hello,

I do not think the problem is with cudamemcpy. It is just that the error is shown at the cudamemcpy line. My guess is that the problem is in the kernel and the error is shown at the next cudamemcpy call. I am not sure if it is problem with this call C_struct_Swarm_optimize_kernel0<<<dimGrid0, dimBlock0, 0, 0>>>(), but I would leave out the “,0,0” part if there are no streams and no shared memory used. Of course it might just be that you are using too many threads per block or too many blocks, or the problem is in the kernel.

Thanks for answering.

I found the problem: There were some memory management faults within the kernel. I called the normal malloc() in a function used in the kernel. Of course this can’t work because the kernel runs on the GPU.

Regards
sw