cudaMemcpy problem

I’m trying to run a kernel that set each position of an array to the value 7, but I can’t figure out why

the result doesn’t change, I suspect the cudaMemcpy is not working, here is the code:

__global__ void kernel(int * d_A ){

	int idx = blockIdx.x * blockDim.x + threadIdx.x;

	d_A[idx] = 7;


int main(int argc, char * argv[])	{


	int r = 10; // vector dimension 


	int *  V = (int*) malloc(sizeof(int)*r); 

	for(int j=0;j<r;j++)

			V[j] = rand();


	printf("\n PREVIEW\n");

	for(int h=0;h<r;h++) 

			printf("\n %d\n", V[h]);


	int *d_A = 0;			


	cudaMalloc((void**) &d_A, r*sizeof(int));



	dim3 dimBlock(10*sizeof(int));


	dim3 dimGrid(ceil(r/(int)10));




	cudaMemcpy(V,d_A,r*sizeof(int), cudaMemcpyDeviceToHost);


	printf("\n Output V:\n");

	for(int h=0;h<r;h++)

			printf("\n %d\n", V[h]);

	// output is the same => V has not been modified by the cudaMemcpy

	return 0;


Your kernel performs out-of-bounds array accesses because you start it with too many threads (you only need 10 threads, not [font=“Courier New”]10*sizeof(int)[/font]). The expression for the grid size also looks quite fragile to me. The common way to express this without any use of floating point arithmetics is font=“Courier New”[/font]. Furthermore your code will fail if the total number of threads is not an integer multiple of the blocksize, because the additional threads from rounding up the block number would also perform out-of-bounds array accesses. This can be prevented by explicitly disabling unneeded threads inside the kernel.

Also have a look at the tips in my signature.

Thank you sir, you solved my problem.