HELP NEEDED! cudamemcpy

I am having trouble with cudamemcpy. I have an array that I am sending in to the cuda kernel and it is suppose to return a modified copy, same length and type. However, when I try to copy my data from the device to the host, I am not getting the correct numbers! I am getting a bit frustrated, if anyone has an insight, please let me know.

Thanks, much appreciated!

unsigned int * cpu_in_image = (unsigned int *) malloc(mem_size);

unsigned int * cpu_out_image = (unsigned int *) malloc(mem_size);


unsigned int* gpu_in_image;

	CUDA_SAFE_CALL(cudaMalloc((void**)&gpu_in_image, mem_size));

	//copy CPU memory to GPU memory

	CUDA_SAFE_CALL(cudaMemcpy(gpu_in_image, cpu_in_image, mem_size, cudaMemcpyHostToDevice));


	//allocate GPU memory for results

	unsigned int *gpu_out_image;

	CUDA_SAFE_CALL(cudaMalloc((void**)&gpu_out_image, mem_size));


	//setup execution parameters


	dim3 grid(1, 1, 1);	

	dim3 threads(num_threads, 1, 1); //Cannot have more than 2^16 Threads. That is, 256*256

	//execute kernel


	negativeKernel<<<grid, threads, 0>>>(gpu_in_image, gpu_out_image);

	CUT_CHECK_ERROR("Kernel execution failed");

	//Copy GPU Results Memory to CPU Results Memory

	CUDA_SAFE_CALL(cudaMemcpy(cpu_out_image, gpu_out_image, mem_size, cudaMemcpyDeviceToHost)); //GPU OUT IMAGE IS MESSED UP

I then use a for loop to display my contents before and after. They are not the correct results.

The kernel is as followed:

__global__ void

negativeKernel( unsigned int* gpu_in_image, unsigned int* gpu_out_image)



	int tid = blockIdx.x * blockDim.x+threadIdx.x;

  	gpu_out_image[tid] =  gpu_in_image[tid];



I don’t know what to do with this anymore!

dim3 threads(num_threads, 1, 1); //Cannot have more than 2^16 Threads. That is, 256*256

what is num_threads? Because you cannot have more than 512 threads per block, not 2^16. So that is looking like your problem, your kernel never executed because num_threads is too big, so you are reading uninitialized values from GPU into out_image

CUT_CHECK_ERROR(“Kernel execution failed”); should catch this, but is only active in debug mode, so my guess is you are compiling in release mode.

Also as a side note, you don’t need the __synthreads() calls in your kernel. They are only needed when threads in a block cooperate with shared memory.

Thanks guys, that makes sense. My threads might be too big. There are definitely more than 512 threads per block. I am using release mode, following some of the examples that came with the software.

num_threads is my thread count, but at this point, it was 128*128, so I need to cut that down.

Much appreciated.