HELP NEEDED! cudamemcpy

prawler3000 · March 14, 2008, 8:20pm

I am having trouble with cudamemcpy. I have an array that I am sending in to the cuda kernel and it is suppose to return a modified copy, same length and type. However, when I try to copy my data from the device to the host, I am not getting the correct numbers! I am getting a bit frustrated, if anyone has an insight, please let me know.

Thanks, much appreciated!

unsigned int * cpu_in_image = (unsigned int *) malloc(mem_size);

unsigned int * cpu_out_image = (unsigned int *) malloc(mem_size);

......

unsigned int* gpu_in_image;

	CUDA_SAFE_CALL(cudaMalloc((void**)&gpu_in_image, mem_size));

	//copy CPU memory to GPU memory

	CUDA_SAFE_CALL(cudaMemcpy(gpu_in_image, cpu_in_image, mem_size, cudaMemcpyHostToDevice));

	

	//allocate GPU memory for results

	unsigned int *gpu_out_image;

	CUDA_SAFE_CALL(cudaMalloc((void**)&gpu_out_image, mem_size));

	

	//setup execution parameters

	

	dim3 grid(1, 1, 1);	

	dim3 threads(num_threads, 1, 1); //Cannot have more than 2^16 Threads. That is, 256*256

	//execute kernel

	

	negativeKernel<<<grid, threads, 0>>>(gpu_in_image, gpu_out_image);

	CUT_CHECK_ERROR("Kernel execution failed");

	//Copy GPU Results Memory to CPU Results Memory

	CUDA_SAFE_CALL(cudaMemcpy(cpu_out_image, gpu_out_image, mem_size, cudaMemcpyDeviceToHost)); //GPU OUT IMAGE IS MESSED UP

I then use a for loop to display my contents before and after. They are not the correct results.

The kernel is as followed:

__global__ void

negativeKernel( unsigned int* gpu_in_image, unsigned int* gpu_out_image)

{

	__syncthreads();

	int tid = blockIdx.x * blockDim.x+threadIdx.x;

  	gpu_out_image[tid] =  gpu_in_image[tid];

	__syncthreads();

}

I don’t know what to do with this anymore!

DenisR · March 14, 2008, 9:11pm

dim3 threads(num_threads, 1, 1); //Cannot have more than 2^16 Threads. That is, 256*256

what is num_threads? Because you cannot have more than 512 threads per block, not 2^16. So that is looking like your problem, your kernel never executed because num_threads is too big, so you are reading uninitialized values from GPU into out_image

CUT_CHECK_ERROR(“Kernel execution failed”); should catch this, but is only active in debug mode, so my guess is you are compiling in release mode.

MisterAnderson42 · March 15, 2008, 1:46pm

Also as a side note, you don’t need the __synthreads() calls in your kernel. They are only needed when threads in a block cooperate with shared memory.

prawler3000 · March 18, 2008, 3:53pm

Thanks guys, that makes sense. My threads might be too big. There are definitely more than 512 threads per block. I am using release mode, following some of the examples that came with the software.

num_threads is my thread count, but at this point, it was 128*128, so I need to cut that down.

Much appreciated.

Topic		Replies	Views
cudaErrorMemoryCopyFailed ..but I don't use cudaMemcpy at all?! CUDA Programming and Performance	7	9970	February 21, 2007
cudaMemcpy sometimes doesn't work CUDA Programming and Performance	5	4480	November 13, 2008
cudaMemcpy CUDA Programming and Performance	3	8417	April 8, 2009
Question about CUDA_SAFE_CALL(cudaMemcpy(hostPx, CUDA_SAFE_CALL(cudaMemcpy(hostPx, device CUDA Programming and Performance	6	47470	January 23, 2009
cudaDeviceSynchronize needed between kernel launch and cudaMemcpy ? CUDA Programming and Performance	15	16287	September 29, 2017
Concurrent Kernel executions & Data Transfers CUDA Programming and Performance cuda	3	611	March 8, 2023
cudaMalloc3D/cudaMemcpy3D & bad values in kernel Always reading 0 (device & emulation modes) CUDA Programming and Performance	0	1281	May 5, 2010
cudaMalloc3D/cudaMemcpy3D & bad values in kernel Always reading 0 (device & emulation modes) CUDA Programming and Performance	0	795	May 5, 2010
syncthreads() issue CUDA Programming and Performance	3	1670	March 29, 2009
cudaMemcpy problem CUDA Programming and Performance	2	1574	June 29, 2012

HELP NEEDED! cudamemcpy

Related topics